In [3]:
import csv
import pandas as pd
import numpy as np
import math as math


In [4]:
## Just importing the data to use

csv_1k_path = 'C:/Users/Shmoops/Desktop/Conda_Stuff/python-prep-backup/fall-2025-predicting-movie-success/modified_data/connor_modified_data/movie_reviews_1k.csv'

csv_all_path = 'C:/Users/Shmoops/Desktop/Conda_Stuff/python-prep-backup/fall-2025-predicting-movie-success/modified_data/connor_modified_data/movie_reviews_all.csv'

df_1k = pd.read_csv(csv_1k_path)
df_all = pd.read_csv(csv_all_path)

row_1k = [x for x in df_1k['moviename']]
row_all = [x for x in df_all['moviename']]

df_1k.index = row_1k
df_all.index = row_all


The next couple of steps are centralizing the data into one data frame to make analysis a bit more straightforward. I can't think of a better way so I'm just building the data frame column by column. For this though, I added row labels to each of these dataframes to make referencing specific locations easier

In [5]:
dict_data = {}

# Start with 1k stuff so we'll use its formatting. Then 
# make this a dataframe since that'll be easier to manipulate

dict_data['moviename'] = df_1k['moviename']
dict_data['one_k_mean'] = df_1k['mean']     #Initial Formation

row_labels = [x for x in df_1k['moviename']]

df_data = pd.DataFrame(dict_data)
df_data.index = row_labels                  #Fully labeled now

## One issue rn is the different datasets have differences in which 
# movies are included. As such, just exclude these exceptions just
## to make life easier. For the  overarching goal of identifying the 
## effects of award winnings on reviews a single data point shouldn't 
## necessarily change much. Can also go back and handle it separately later

df_all_stats = df_data.drop('the-trial-of-the-chicago-7', axis = 0)
df_all_stats = df_data.drop('the-favourite', axis = 0)

all_mean_correct_format = []

for movie_title in df_all_stats['moviename']:    #Now reordering to add the all_means
    all_mean_correct_format.append(df_all.loc[movie_title,'mean'])

df_all_stats['all_mean'] = all_mean_correct_format  #Now all the averages are in one dataframe

df_all_stats


Unnamed: 0,moviename,one_k_mean,all_mean
boyhood,boyhood,4.44,3.76
black-panther,black-panther,3.88,3.75
cold-war-2018,cold-war-2018,4.06,3.97
women-talking,women-talking,3.97,3.76
oppenheimer-2023,oppenheimer-2023,4.49,4.16
...,...,...,...
nomadland,nomadland,4.14,3.75
the-fabelmans,the-fabelmans,4.29,3.96
promising-young-woman,promising-young-woman,3.83,3.69
dont-look-up-2021,dont-look-up-2021,3.26,3.09


The big goal now is to also put all of the other necessary statistics in df_all_stats. We want to 
quanitify the effect of winning awards, so we'll also add in the standard deviations from both 
datasets

In [6]:
all_std_dev = []
one_k_std_dev = []

for x in df_all_stats['moviename']:
    all_std_dev.append(df_all.loc[x,'std_dev'])
    one_k_std_dev.append(df_1k.loc[x,'std_dev'])

df_all_stats['one_k_std_dev'] = one_k_std_dev
df_all_stats['all_std_dev'] = all_std_dev

df_all_stats

Unnamed: 0,moviename,one_k_mean,all_mean,one_k_std_dev,all_std_dev
boyhood,boyhood,4.44,3.76,0.61,0.93
black-panther,black-panther,3.88,3.75,0.77,0.88
cold-war-2018,cold-war-2018,4.06,3.97,0.74,0.77
women-talking,women-talking,3.97,3.76,0.80,0.84
oppenheimer-2023,oppenheimer-2023,4.49,4.16,0.83,0.81
...,...,...,...,...,...
nomadland,nomadland,4.14,3.75,0.73,0.87
the-fabelmans,the-fabelmans,4.29,3.96,0.70,0.77
promising-young-woman,promising-young-woman,3.83,3.69,0.91,0.94
dont-look-up-2021,dont-look-up-2021,3.26,3.09,1.07,0.98


Now we have everything in one place and we can start looking at the changes across a number of metrics

In [7]:
dict_changes = {}

dict_changes['moviename'] = df_all_stats['moviename']
change_labels = [x for x in df_all_stats['moviename']]



df_changes = pd.DataFrame(dict_changes, index=change_labels)

# print(change_labels)
# df_changes

mean_changes = [df_all_stats.loc[x, 'one_k_mean'] - df_all_stats.loc[x,'all_mean'] for x in change_labels]
df_changes['mean_change'] = mean_changes

From here the methodology is as follows:

We'll assume that the first one thousand reviews are more representative of the 'true' review distribuition. As such any chages for the complete list of reviews will be measured against that. 
Specifically we want to some metric d('change in mean', 'original standard deviation') to 
quantify how large of a change has occurred. For instance if we have movie A whose data 
in the frame is 

A: one_k_mean = 3, one_k_std_dev = 0.7, all_mean = 4.5

, and similarly a movie B whose data is 

B: one_k_mean = 3, one_k_std_dev = 0.5, all_mean = 4.5

, the change in B is more significant in some sense because there is a larger change in the 
mean relative to the standard deviation.

For right now we need to include the standard deviations into the data frame as well.

In [11]:
one_k_std_dev2 = []

for x in df_changes['moviename']:
    one_k_std_dev2.append(df_1k.loc[x,'std_dev'])

df_changes['one_k_std_dev'] = one_k_std_dev2
df_changes

Unnamed: 0,moviename,mean_change,one_k_std_dev,change_vectors_size
boyhood,boyhood,0.68,0.61,0.652
black-panther,black-panther,0.13,0.77,0.386
cold-war-2018,cold-war-2018,0.09,0.74,0.350
women-talking,women-talking,0.21,0.80,0.446
oppenheimer-2023,oppenheimer-2023,0.33,0.83,0.530
...,...,...,...,...
nomadland,nomadland,0.39,0.73,0.526
the-fabelmans,the-fabelmans,0.33,0.70,0.478
promising-young-woman,promising-young-woman,0.14,0.91,0.448
dont-look-up-2021,dont-look-up-2021,0.17,1.07,0.530


Ideally we want a function d to measure these sizes of changes which satisfy these properties: 

Regarding a movie as a vector A = (C_A, S_A) where C_A = change in mean for movie A, and 
S = standard deviation for A, if we have two movies A and B we want 

1.) If C_A = C_B then d(A) >= d(B) iff S_A >= S_B

2.) If S_A = S_B then d(A) >= d(B) iff C_A >= C_B

3.) If C_A, S_A = 0 then d(A) = 0

These three properties suggest some sort of linearity so that 

d(A) = w_C * C + w_S * S

However the issue at the moment is the dependence on our choice of weights (w_C, w_S) since different weights can potentially give different results. To get rid of this, we'll just use all possible weights and instead consider d as a function d(A,w_C,w_S). Hence really have the 
assignment A -> d_A(w_C, w_S). To get back to a number what we'll do next is to then consider the total size of the graph over some 
predetermined domain, which in this case we'll choose to be (w_C, w_S) \in [0,1] x [0,1]. In this light, really we want the association

A -> \int_{[0,1]^2} d_A(w_C, w_S) dw_C dw_S

= \int_[0,1] \int_[0,1] C_A * w_C + S_A * w_S d w_C d w_S

= 0.5*(C_A + S_A)

so we just arrive back at the case of w_C = 1/2, w_S = 1/2


In [18]:
def dist_func(C,S):
    return 0.5*C + 0.5*S

change_vectors_size = [dist_func(df_changes.loc[x,'mean_change'], df_changes.loc[x,'one_k_std_dev']) for x in df_changes['moviename']]

df_changes['change_vectors_size'] = change_vectors_size
df_changes_sorted = df_changes.sort_values(by = 'change_vectors_size') #Just to make the analysis easier
df_changes_sorted.to_csv('review_changes.csv')