# Lab 4: Extending Logistic Regression

### Data Mangling:

Mangling of the CSV:
- We first removed the imdb link from the csv because we knew we would never need to use that (**Note: this was the only feature removed from the csv**)
- We then went through and deleted all of the movies that were made in another country (foriegn films) we did this because we wanted to just look at American films, also because the currency units for those countries (for budget and gross) were in native currency units, not USD, and with changing exchange rates, it's not very easy to compare across countries.
- We then went through and converted all 0 values for gross, movie_facebook_likes, and director_facebook_likes to a blank value in the csv (so that it is read in as NaN by pandas), this is so that we cna more easily impute values later. Note: according to the description on the kaggle entry, because of the way the data was scraped, some movies had missing data. The Python scraper just made these values into a 0 instead of NaN.
- We then removed all movies with an undefined gross. Being the feature we are trying to predict, we should not be imputing values for gross to train our model. That will basically reduce our model to an imputation algorithm...
- We then removed all movies that were made before 1935. We did this because there were only a handful of movies ranging from 1915 to 1935, the way we are classifying budget (described below) would not work with a small sample of movies from that time period. We could have cut this number at a different year (say 1960), but we didn't want to exclude such classics as "Bambi" or "Gone With the Wind"

Mangling of the Data:
- After the above steps, we made more edits to the data using pandas. First, we removed features that we thought would be un-useful to our prediction algorithm. We removed all features concerning facebook likes. We did this because a significant portion of the movies in the training set debuted before facebook was invented and widely adopted. While some of these movies have received retroactive "likes" on facebook, only the most famous classics received a substantial amount of retraoctive "likes". Most lesser known films received very low amounts of "likes" (presumably because modern movie watchers don't really care to search for lesser known movies on facebook, or because the movie doesn't have a facebook). For this reason we decided to remove movie_facebook_likes
- Likewise, we removed the other "likes" for the same reasons as above. For example, the esteemed director George Lucas has a total of 0 "likes" between all of his films. This feature obviously would not help us predict the profitability of movies.
- We also removed irrelevant information such as aspect_ratio, language, and country. Because we deleted all foreign films the country will always be USA. A simple filter of the data reveals that there are no more than 20 movies made in the US that use a language other than English, therefore there is not enough data to use language as training feature. However, we did not delete the movies in a different language, because most of them were famous films such as *Letters from Iwo Jima* and *The Kite Runner*. We still count them as a valuable part of the dataset, just don't find the language of particular value. Lastly, we removed aspect_ratio because that seems to be unimportant for predicting the success of a movie.

In [21]:
import pandas as pd
import numpy as np

df = pd.read_csv("movie_metadata.csv")
for x in ['movie_facebook_likes', 'director_facebook_likes', 'actor_2_facebook_likes', 
          'actor_1_facebook_likes','actor_3_facebook_likes', 'cast_total_facebook_likes',
          'aspect_ratio', 'language', 'country']:
    if x in df:
        del df[x]
print(df.info())

In [28]:
# Tamper with the groupings to improve imputations? How do we improve how many values get imputed?
df_grouped = df.groupby(by=['director_name','imdb_score'])

In [29]:
df_imputed = df_grouped.transform(lambda grp: grp.fillna(grp.median()))
col_deleted = list( set(df.columns) - set(df_imputed.columns)) #in case the median op deleted columns
df_imputed[col_deleted] = df[col_deleted]

print(df_imputed.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3232 entries, 0 to 3231
Data columns (total 18 columns):
num_critic_for_reviews    3229 non-null float64
duration                  3231 non-null float64
gross                     3232 non-null int64
num_voted_users           3232 non-null int64
facenumber_in_poster      3227 non-null float64
num_user_for_reviews      3231 non-null float64
budget                    3085 non-null float64
title_year                3232 non-null int64
color                     3231 non-null object
movie_title               3232 non-null object
genres                    3232 non-null object
plot_keywords             3208 non-null object
actor_2_name              3229 non-null object
director_name             3232 non-null object
imdb_score                3232 non-null float64
content_rating            3206 non-null object
actor_3_name              3225 non-null object
actor_1_name              3230 non-null object
dtypes: float64(6), int64(3), object(9)
memo

In [30]:
df_imputed.dropna(inplace=True)
print (df_imputed.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3042 entries, 0 to 3230
Data columns (total 18 columns):
num_critic_for_reviews    3042 non-null float64
duration                  3042 non-null float64
gross                     3042 non-null int64
num_voted_users           3042 non-null int64
facenumber_in_poster      3042 non-null float64
num_user_for_reviews      3042 non-null float64
budget                    3042 non-null float64
title_year                3042 non-null int64
color                     3042 non-null object
movie_title               3042 non-null object
genres                    3042 non-null object
plot_keywords             3042 non-null object
actor_2_name              3042 non-null object
director_name             3042 non-null object
imdb_score                3042 non-null float64
content_rating            3042 non-null object
actor_3_name              3042 non-null object
actor_1_name              3042 non-null object
dtypes: float64(6), int64(3), object(9)
memo