<a href="https://colab.research.google.com/github/SJulapalli/2018-19-Introduction-to-Computer-Science-Projects/blob/main/FinalProject(2)(1).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Final Project

For the following project, you will be working with a movie dataset. The dataset is [here](https://drive.google.com/file/d/1R53inu8Jcb9GGoyiuVnBMVnO7XvCaJAE/view?usp=drive_link). The dataset columns are as follows:

* Title: The movie's title
* Genre: The movie's genre
* Stars: The number of famous actors in the movie
* Runtime: The length of the movie's runtime
* Budget: How much was spent on filming the movie (in millions)
* Promo: How much money was spent promoting the movie (in millions)
* Season: The season in which the movie was released
* Rating: The movie's rating
* R1: Reviewer 1's review
* R1: Reviewer 2's review
* R1: Reviewer 3's review

And the target variable:

* Success: Whether the film was a success or a flop

Fill in the answers to questions in the text field, and show your code below.

# Data loading

Load the data

In [18]:
!pip install vaderSentiment
import pandas as pd
import scipy as sp
import numpy as np
import plotly.express as px
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from IPython.display import HTML

df = pd.read_csv('CMSC320FinalProjectData(1).csv')



# Data Cleaning

List the three biggest data errors below, with a summary of how you fixed them and why you choose that method:


*
*
*

Problems Identified

Outliers: There are 25 rows with budgest exceeding \$500, all of whom have a budget exceeding \$2.5 million. These outliers significant skew the data's generalized metrics, making up approximately .04% of the dataset. There is no effective method of standardization that can be applied here, as maintaining the relationship between the budgets will naturally skew the dataset. As such, I've opted to drop the data.

Runtime < 40 minutes: There are 17 rows where the movie's runtime is equal to 0. This is an issue on two fronts. A movie cannot exist that has a runtime of 0 minutes, and the accepted conception of a film by international film associations has a feature film at a minimum length of 40-60 minutes. Thus, these rows suggest an error on two fronts. These rows make up ~3% of the data, with the total rows dropped ~7%, so they were dropped.

Stars == 100: There are 2 rows where the movie was provide 100 stars. All other movies are provided a 0-5 star rating, as such these rows much be using a different scale from the other entries, or have a mistaken/corrupted entry. These were dropped, with the total number of rows dropped equal to ~7.5% of the dataset. Though this is slightly more than wanted, there was no other effective method of managing this problem.

In [25]:
df = df.drop(df[df['Budget'] > 500]['Unnamed: 0'])

df = df.drop(df[df['Runtime'] == 0]['Unnamed: 0'])

df = df.drop(df[df['Stars'] == 100]['Unnamed: 0'])

# Data Exploration






Does Season have a stastically significant impact on a movie's success?

**p-value: 0.001990401880665208**

Null Hypothesis: There is no relationship between Season and Success

Alternative Hypothesis: Season has a statistically relevant impact on a movie's success.

Alpha = .05

Per the p-value, the null hypothesis ought to be rejected, and we should assume that season has a statistically relevant impact on a movie's success.

In [26]:
contingency_table = pd.crosstab(df['Season'], df['Success'])

sp.stats.chi2_contingency(contingency_table).pvalue

0.001990401880665208

Do seasons have a statistically significant difference in their distribution of content ratings?

**p-value: 0.4064940513498879**

Null Hypothesis: There is no statistically relevant difference in the means of the content ratings across various seasons

Alternative Hypothesis: There is a statistically relevant difference in the means of the content ratings between some seasons

Alpha = .05

Per the p-value, the null hypothesis cannot be rejected, and we must continue with the assumption that there is no statistically relevant difference in the means of the content ratings across seasons.

In [27]:
spring_ratings = df[df['Season'] == 'Spring']['Stars']
summer_ratings = df[df['Season'] == 'Summer']['Stars']
fall_ratings = df[df['Season'] == 'Fall']['Stars']
winter_ratings = df[df['Season'] == 'Winter']['Stars']

sp.stats.f_oneway(spring_ratings, summer_ratings, fall_ratings, winter_ratings).pvalue

0.4064940513498879

Who is the harshest critic (highest precent of negative reviews)?

**Critic: R1**

In [28]:
sentiment = SentimentIntensityAnalyzer()
r1_num_neg = 0
r2_num_neg = 0
r3_num_neg = 0

for i in df['Unnamed: 0']:
  if sentiment.polarity_scores(df['R1'][i])['compound'] < 0:
    r1_num_neg += 1

  if sentiment.polarity_scores(df['R2'][i])['compound'] < 0:
    r2_num_neg += 1

  if sentiment.polarity_scores(df['R3'][i])['compound'] < 0:
    r3_num_neg += 1

print(f'R1: ~{int(r1_num_neg / 5.2)}%', f'R2: ~{int(r2_num_neg / 5.2)}%', f'R3: ~{int(r3_num_neg / 5.2)}%')

R1: ~49% R2: ~37% R3: ~30%



What is the covariance between promotional budget and the filming budget?

**Cov: 1829.9233034597808**

In [29]:
sum_of_deviations = 0
avg_promotional = 0
avg_filming = 0

for i in df['Unnamed: 0']:
  avg_promotional += df['Promo'][i]
  avg_filming += df['Budget'][i]

avg_promotional /= 520
avg_filming /= 520

for i in df['Unnamed: 0']:
  sum_of_deviations += (df['Promo'][i] - avg_promotional) * (df['Budget'][i] - avg_filming)

covariance = sum_of_deviations / 519

covariance

1829.9233034597808

# Data Visualization

Create a chart that compares the distribution of the budget for each different number of stars. (It does not need to be particularly appealing.

In [31]:
frequencies = pd.concat([df['Budget'], df['Stars']], axis=1)

frequencies = frequencies.groupby(['Stars']).value_counts()

fig = px.violin(df, x='Stars', y='Budget', color='Stars', box=True, points="all",)
fig.show()

Create a graph showing the average movie budget over time.

In [32]:
years = df['Year'].unique()
average_budgets = df.groupby(by=['Year']).mean()['Budget']

fig = px.line(average_budgets, title='Average Movie Budget Over Time')
fig.show()


The default value of numeric_only in DataFrameGroupBy.mean is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.



# Feature Engineering

List any features you choose to create (if you are creating many features based on one column, you do not need to list them separately.) You are not required to create any features if you do not wish to. You may create any number of additional features.


*
*

Features Added:

 1) One-hot encoded Season and Rating.

 2) Added average compound sentiment column

 3) Dropped Unnamed: 0, Title, Success (after splitting into its own column for testing) and R1-3

In [None]:
fe_df = pd.get_dummies(df, columns=['Season', 'Rating', 'Genre'])

compound_sentiments = []

for i in df['Unnamed: 0']:
  compound_sentiments += [(sentiment.polarity_scores(df['R1'][i])['compound'] + sentiment.polarity_scores(df['R2'][i])['compound'] + sentiment.polarity_scores(df['R3'][i])['compound']) / 3]

fe_df['Average_Sentiment'] = compound_sentiments

fe_df = fe_df.drop(columns=['Unnamed: 0', 'Title', 'R1', 'R2', 'R3'])

fe_df

Unnamed: 0,Runtime,Stars,Year,Budget,Promo,Success,Season_Fall,Season_Spring,Season_Summer,Season_Winter,Rating_PG,Rating_PG13,Rating_R,Genre_Action,Genre_Drama,Genre_Fantasy,Genre_Romantic Comedy,Genre_Science fiction,Average_Sentiment
0,126,1,2020,6.679387e+07,73.543754,False,0,0,0,1,1,0,0,0,0,0,1,0,0.055367
1,131,0,2020,4.667863e+01,33.572003,False,1,0,0,0,1,0,0,0,0,0,1,0,0.011600
2,132,4,2000,3.639134e+01,54.561523,False,0,0,1,0,1,0,0,0,0,0,1,0,0.263533
3,132,1,2015,9.324732e+01,59.714535,False,0,0,0,1,0,1,0,0,0,0,1,0,-0.017667
4,119,1,2015,9.213021e+01,67.643810,False,1,0,0,0,0,1,0,0,0,0,1,0,0.110367
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
535,128,3,2021,6.489702e+01,91.445593,False,1,0,0,0,1,0,0,1,0,0,0,0,-0.289567
536,123,1,2018,3.098935e+01,46.045408,True,0,0,1,0,0,0,1,1,0,0,0,0,0.000000
537,121,1,2003,4.857255e+01,63.660912,False,0,0,1,0,1,0,0,1,0,0,0,0,-0.400800
538,124,1,2007,1.364682e+02,188.513344,True,0,0,1,0,0,0,1,1,0,0,0,0,0.209933


# Modeling

Create a model of your choice.

**Model type choosen:**

In [None]:
clf = RandomForestClassifier(max_depth=3, random_state=1)

# Testing

Shuffle your data and break it into a 10% test set and 90% training set. Show your model's accuracy on the test set. In order to get full credit, the model's accuracy must be higher than 50%.

**Model accuracy: 0.8703703703703703**

In [None]:
train, test = train_test_split(fe_df, test_size=0.1)

success_train = train['Success']
success_test = test['Success']

train = train.drop(columns=['Success'])
test = test.drop(columns=['Success'])

clf.fit(train, success_train)

clf.score(test, success_test)

0.8148148148148148

Show the confusion matrix for your model. To get full credit, the percent of false negatives and the percent of false positives must be under 30%. (Divide false negatives by total, and divide false positives by total, and make sure both numbers are under 30%).


**False negative rate: 0.18518518518518517**

**False positive rate: 0.0**

In [None]:
predicted_test = clf.predict(test)
print(confusion_matrix(success_test, predicted_test))
tn, fp, fn, tp = confusion_matrix(success_test, predicted_test).ravel()
total = tn + fp + fn + tp

print('True Positive Rate: ', tp / total)
print('True Negative Rate: ', tn / total)
print('False Positive Rate: ', fp / total)
print('False Positive Rate: ', fn / total)

[[43  0]
 [10  1]]
True Positive Rate:  0.018518518518518517
True Negative Rate:  0.7962962962962963
False Positive Rate:  0.0
False Positive Rate:  0.18518518518518517


What was the most important feature for your model? Don't guess, either look up how to check or do your own tests.

**Most important feature: Average_Sentiment**


In [None]:
feature_importances = clf.feature_importances_

max_importance = feature_importances.max()
col = None

for i in range(len(feature_importances)):
  if feature_importances[i] == max_importance:
    col = train.columns[i]
    break

col

'Average_Sentiment'