## Movie Revenue Prediction 

### Objective: Your client is a movie studio and they need to be able to predict movie revenue in order to greenlight the project and assign a budget to it. 
- Most of the data is comprised of categorical variables. 
- While the budget for the movie is known in the dataset, it is often an unknown variable during the greenlighting process. 

In [1]:
# Import necessary modules
import pandas as pd
from pandas import DataFrame
import numpy as np
import matplotlib.pyplot as plt
import datetime

### Section 1: Preprocessing and Exploratory Analysis
- Load Movie_Revenue_Predictions.csv data
- Vizualize Data (EDA)

In [2]:
# Load the dataset
movie_df = DataFrame(pd.read_csv('Movie_Revenue_Predictions.csv'))
movie_df.head(2)

Unnamed: 0,title,tagline,revenue,budget,genres,homepage,id,keywords,original_language,overview,production_companies,production_countries,release_date,runtime,spoken_languages,status
0,Avatar,Enter the World of Pandora.,2787965087,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,"In the 22nd century, a paraplegic Marine is di...","[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",12/10/09,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released
1,Pirates of the Caribbean: At World's End,"At the end of the world, the adventure begins.",961000000,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,"Captain Barbossa, long believed to be dead, ha...","[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",5/19/07,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released


In [3]:
# Assess shape of data
movie_df.shape

(4803, 16)

In [4]:
# Assess dataframe
movie_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 16 columns):
title                   4803 non-null object
tagline                 3959 non-null object
revenue                 4803 non-null int64
budget                  4803 non-null int64
genres                  4803 non-null object
homepage                1712 non-null object
id                      4803 non-null int64
keywords                4803 non-null object
original_language       4803 non-null object
overview                4800 non-null object
production_companies    4803 non-null object
production_countries    4803 non-null object
release_date            4802 non-null object
runtime                 4801 non-null float64
spoken_languages        4803 non-null object
status                  4803 non-null object
dtypes: float64(1), int64(3), object(12)
memory usage: 600.5+ KB


In [5]:
movie_df.isnull().sum()

title                      0
tagline                  844
revenue                    0
budget                     0
genres                     0
homepage                3091
id                         0
keywords                   0
original_language          0
overview                   3
production_companies       0
production_countries       0
release_date               1
runtime                    2
spoken_languages           0
status                     0
dtype: int64

**Observations**: There appear to be various columns with either missing or Null data.

## A. Resolution of NaN variables
#### 1. Numerical columns:
   - **runtime** (2 NaN)
       - since there are so few instances, dropping the NaN rows will not interfere with the analysis of movie revenue. 
        
#### 2. Categorical/Object columns:
Filling in NaN categorical values in the remaining columns is a bit tricky since there is no easily-applied statistical method.
   - **homepage** (3091 NaN)
        - 3/4 of the data is missing (3091 of the total 4803), so this column cannot be effectively utilized for this model and should be dropped.
   - **overview** (3 NaN)
        - since there are so few instances, dropping the NaN rows will not interfere with the analysis of movie revenue. 
   - **release_date** (1 NaN)
        - since there are so few instances, dropping the NaN rows will not interfere with the analysis of movie revenue. 
   - **tagline** (844 NaN)
        - while 844 is well under 1/4 of the total data, the tagline for a movie may in fact have a significant impact on movie revenue due to its marketing implications. Therefore, I will attempt to predict the tagline column missing values with random forest, as documented in this source article: https://www.mikulskibartosz.name/fill-missing-values-using-random-forest/. Sense this will result in the conversion of the dataset in to numerical/encoded values, this resolution will be reserved for after all other values have been resolved.
        
## B. Feature Engineering of Values of Columns with Nested Lists

#### 1. Categorical Columns with Nested (possibly JSON) Data:
   - **genres**
   - **keywords**
   - **production_companies**
   - **production_countries**
   - **spoken_languages**

### A. Resolve NaNs, minus applying Random Forest for *tagline* column predictions.

In [6]:
# homepage
print('Homepage: ', movie_df.homepage[0])

# overview
print('Overview: ', movie_df.overview[0])

# release_date
print('Release: ', movie_df.release_date[0], isinstance(movie_df.release_date[0], datetime.date))

# runtime
print('Runtime: ', movie_df.runtime[0])

Homepage:  http://www.avatarmovie.com/
Overview:  In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization.
Release:  12/10/09 False
Runtime:  162.0


1. Fix NaNs in Numerical Columns

In [7]:
# only select rows where overview, runtime, and release_date columns are "not null"
movie_df = movie_df.dropna(subset=['runtime', 'overview', 'release_date'])
len(movie_df)

4799

In [8]:
# Drop unnecessary columns
movie_df = movie_df.drop(columns=['homepage'])
movie_df.head(2)

Unnamed: 0,title,tagline,revenue,budget,genres,id,keywords,original_language,overview,production_companies,production_countries,release_date,runtime,spoken_languages,status
0,Avatar,Enter the World of Pandora.,2787965087,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,"In the 22nd century, a paraplegic Marine is di...","[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",12/10/09,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released
1,Pirates of the Caribbean: At World's End,"At the end of the world, the adventure begins.",961000000,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,"Captain Barbossa, long believed to be dead, ha...","[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",5/19/07,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released


### B. Feature Engineer Contents of Nested Instance Columns

In [9]:
# genres
print('Genres: ', movie_df.genres[0])
print('  ')
# keywords
print('Keywords: ', movie_df.keywords[0])
print('  ')
# production_companies
print('Production Companies: ', movie_df.production_companies[0])
print('  ')
# production_countries
print('Production Countries: ', movie_df.production_countries[0])
print('  ')
# spoken_languages
print('Spoken Languages: ', movie_df.spoken_languages[0])

Genres:  [{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 878, "name": "Science Fiction"}]
  
Keywords:  [{"id": 1463, "name": "culture clash"}, {"id": 2964, "name": "future"}, {"id": 3386, "name": "space war"}, {"id": 3388, "name": "space colony"}, {"id": 3679, "name": "society"}, {"id": 3801, "name": "space travel"}, {"id": 9685, "name": "futuristic"}, {"id": 9840, "name": "romance"}, {"id": 9882, "name": "space"}, {"id": 9951, "name": "alien"}, {"id": 10148, "name": "tribe"}, {"id": 10158, "name": "alien planet"}, {"id": 10987, "name": "cgi"}, {"id": 11399, "name": "marine"}, {"id": 13065, "name": "soldier"}, {"id": 14643, "name": "battle"}, {"id": 14720, "name": "love affair"}, {"id": 165431, "name": "anti war"}, {"id": 193554, "name": "power relations"}, {"id": 206690, "name": "mind and soul"}, {"id": 209714, "name": "3d"}]
  
Production Companies:  [{"name": "Ingenious Film Partners", "id": 289}, {"name": "Twentieth Century Fox

#### 1. Organize Nested List of Dictionaries by Movie Title

In [10]:
# Import necessary modules
import ast

In [11]:
# Create function to convert strings to list of dictionaries 
def listdictstr_to_listdictkey(data):
    """Convert Column of Dictionary Strings to Column of Lists of Dictionaries and Return Unique Keys"""
    
    # Define an empty list var
    dict_list = []
    unique_keys = []
    
    # Convert dictionary string to list of lists of dictionaries
    for instance in data:     
        dl = ast.literal_eval(instance)
        dict_list.append(dl)
    
    # Select unique keys
    for lists in dict_list:
        for d in lists:
            for key in d:
                unique_keys.append(key)
    unique_keys = set(unique_keys)
    
    return dict_list, unique_keys

In [12]:
# Create a function that produces a dataframe from the column list of dictionaries
def dictlist_to_dataframe(dict_list, unique_keys):
    """Convert Column of Lists of Dictionaries to one Merged Dataframe"""
    
    # Create column list 
    columns = list(unique_keys)
    new_df = pd.DataFrame(columns=columns)

    for i in range(len(dict_list)):
    
        # Select movie title by matching index in list
        dlist = dict_list[i]
        movie = movie_df.iloc[i, 0]
    
        if dlist == [ ] or dlist == '' or dlist == [np.nan]:
            content = [{'id': 0, 'name': 'Not Specified'}]
            df = pd.DataFrame(content, columns=columns) 
        else: 
            df = pd.DataFrame(dlist, columns=columns) 
        df['title'] = movie
    
        # Merge movie title instances into one combined dataframe
        new_df = new_df.append(df, sort=False, ignore_index=True)
        
    return new_df

In [13]:
# Genres
genres_list, genres_keys = listdictstr_to_listdictkey(movie_df.genres)
print(len(genres_list))

# Genres Dataframe
genres_df = dictlist_to_dataframe(genres_list, genres_keys)
print(len(np.unique(genres_df['title'])))
genres_df.head()       

4799
4796


Unnamed: 0,id,name,title
0,28,Action,Avatar
1,12,Adventure,Avatar
2,14,Fantasy,Avatar
3,878,Science Fiction,Avatar
4,12,Adventure,Pirates of the Caribbean: At World's End


In [14]:
# Keywords 
keywords_list, keywords_keys = listdictstr_to_listdictkey(movie_df.keywords)
print(len(keywords_list))

# Keywords Dataframe
keywords_df = dictlist_to_dataframe(keywords_list, keywords_keys)
print(len(np.unique(keywords_df['title'])))
keywords_df.head()

4799
4796


Unnamed: 0,id,name,title
0,1463,culture clash,Avatar
1,2964,future,Avatar
2,3386,space war,Avatar
3,3388,space colony,Avatar
4,3679,society,Avatar


In [15]:
# Production Companies
prod_comp_list, prod_comp_keys = listdictstr_to_listdictkey(movie_df.production_companies)
print(len(prod_comp_list))

# Production Companies Dataframe
prod_companies_df = dictlist_to_dataframe(prod_comp_list, prod_comp_keys)
print(len(np.unique(prod_companies_df['title'])))
prod_companies_df.head()

4799
4796


Unnamed: 0,id,name,title
0,289,Ingenious Film Partners,Avatar
1,306,Twentieth Century Fox Film Corporation,Avatar
2,444,Dune Entertainment,Avatar
3,574,Lightstorm Entertainment,Avatar
4,2,Walt Disney Pictures,Pirates of the Caribbean: At World's End


In [16]:
# Production Countries
prod_count_list, prod_count_keys = listdictstr_to_listdictkey(movie_df.production_countries)
print(len(prod_count_list))

# Production Countries Dataframe
prod_countries_df = dictlist_to_dataframe(prod_count_list, prod_count_keys)

# Rename columns 
prod_countries_df.columns = ['id', 'name', 'title']
print(len(np.unique(prod_countries_df['title'])))
prod_countries_df.head()

4799
4796


Unnamed: 0,id,name,title
0,United States of America,US,Avatar
1,United Kingdom,GB,Avatar
2,United States of America,US,Pirates of the Caribbean: At World's End
3,United Kingdom,GB,Spectre
4,United States of America,US,Spectre


In [17]:
# Spoken Languages
spoken_lang_list, spoken_lang_keys = listdictstr_to_listdictkey(movie_df.spoken_languages)
print(len(spoken_lang_list))

# Spoken Languages Dataframe
spoken_lang_df = dictlist_to_dataframe(spoken_lang_list, spoken_lang_keys)

# Rename columns 
spoken_lang_df.columns = ['id', 'name', 'title']
print(len(np.unique(spoken_lang_df['title'])))
spoken_lang_df.head()

4799
4796


Unnamed: 0,id,name,title
0,en,English,Avatar
1,es,Español,Avatar
2,en,English,Pirates of the Caribbean: At World's End
3,fr,Français,Spectre
4,en,English,Spectre


*Observations*: Despite many modifications, the accuracy of the **dictlist_to_dataframe** function remains at *4796* out of *4799*. Since this is a consistent value (4796) for all columns analyzed with the function, and since the discrepancy relatively small, the project will move forward as-is, and will drop all NaN resulting from appending these new values to the movie_df.

#### 2. Create feature variables based on Column trends of each Movie Title 

Replace unprocessed columns with new lists in a new movie_df, movie_df_clean

In [214]:
# Make copy of movie_df
movie_df_modified = movie_df.copy()

# Drop original columns that have been modified with pd.Dummies
movie_df_modified = movie_df_modified.drop(columns=['genres', 'keywords', 'production_companies',
                                  'production_countries', 'spoken_languages'])
movie_df_modified.head(1)

Unnamed: 0,title,tagline,revenue,budget,id,original_language,overview,release_date,runtime,status
0,Avatar,Enter the World of Pandora.,2787965087,237000000,19995,en,"In the 22nd century, a paraplegic Marine is di...",12/10/09,162.0,Released


In [225]:
movie_df_modified.columns

Index(['title', 'tagline', 'revenue', 'budget', 'id', 'original_language',
       'overview', 'release_date', 'runtime', 'status'],
      dtype='object')

In [19]:
# Create a function that replaces column contents with new (cleaned) values
def dataframe_to_dummies(df, col_prefix):
    """Convert Dataframe to List Instance by Movie Title"""
    
    df_column = []
    
    # Convert dataframe rows to lists by genre
    for title in df['title'].unique():
        title_df = df.loc[(df['title'] == title)]
        name_list = list(title_df['name'].unique())
        df_column.append(name_list)
        
    # Get Dummies
    df_column = pd.Series(df_column)
    new_df = pd.get_dummies(df_column.apply(pd.Series).stack(), prefix=col_prefix).sum(level=0)
    new_df['title'] = df['title']
    
    return new_df

Apply function to all columns in question

In [20]:
# genres
genres_df_dummies = dataframe_to_dummies(genres_df, 'genre')

# keywords
keywords_df_dummies = dataframe_to_dummies(keywords_df, 'keyword')

# production_companies
prod_companies_df_dummies = dataframe_to_dummies(prod_companies_df, 'prod_company')

# production_countries
prod_countries_df_dummies = dataframe_to_dummies(prod_countries_df, 'prod_country')

# spoken_languages
spoken_lang_df_dummies = dataframe_to_dummies(spoken_lang_df, 'spoken_lang')

In [171]:
# Test outputs
print(genres_df_dummies['title'].head(3))
print(keywords_df_dummies['title'].head(3))
print(prod_companies_df_dummies['title'].head(3))
print(prod_countries_df_dummies['title'].head(3))
print(spoken_lang_df_dummies['title'].head(3))

0    Avatar
1    Avatar
2    Avatar
Name: title, dtype: object
0    Avatar
1    Avatar
2    Avatar
Name: title, dtype: object
0    Avatar
1    Avatar
2    Avatar
Name: title, dtype: object
0                                      Avatar
1                                      Avatar
2    Pirates of the Caribbean: At World's End
Name: title, dtype: object
0                                      Avatar
1                                      Avatar
2    Pirates of the Caribbean: At World's End
Name: title, dtype: object


Merge pd.Dummies dataframes to movie_df_clean

In [186]:
# Define list of all dataframes to be merged together
dfs = [movie_df_modified, genres_df_dummies, keywords_df_dummies, 
       prod_companies_df_dummies, prod_countries_df_dummies, spoken_lang_df_dummies]

In [205]:
# Concat movie_df with dummies dataframes
df_dummies = pd.concat(dfs, sort=False, join='outer', axis=0).groupby('title').agg('sum')

In [211]:
print(df_dummies.shape)
df_dummies.head()

(4796, 14997)


Unnamed: 0_level_0,genre_Action,genre_Adventure,genre_Animation,genre_Comedy,genre_Crime,genre_Documentary,genre_Drama,genre_Family,genre_Fantasy,genre_Foreign,...,spoken_lang_বাংলা,spoken_lang_ਪੰਜਾਬੀ,spoken_lang_தமிழ்,spoken_lang_తెలుగు,spoken_lang_ภาษาไทย,spoken_lang_ქართული,spoken_lang_广州话 / 廣州話,spoken_lang_日本語,spoken_lang_普通话,spoken_lang_한국어/조선말
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
#Horror,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
(500) Days of Summer,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10 Cloverfield Lane,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10 Days in a Madhouse,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10 Things I Hate About You,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [207]:
# Drop revenue, budget, id, and runtime
df_dummies = df_dummies.drop(columns=['revenue', 'budget', 'id', 'runtime'])

In [229]:
# Reindex to movie_df to match dummies dataframe title order
movie_revenue = movie_df_modified[['title', 'tagline', 'revenue', 'budget', 'id', 'original_language',
       'overview', 'release_date', 'runtime', 'status']].sort_values(by='title')
print(movie_revenue.shape)
movie_revenue.head()

(4799, 10)


Unnamed: 0,title,tagline,revenue,budget,id,original_language,overview,release_date,runtime,status
4257,#Horror,Death is trending.,0,1500000,301325,de,"Inspired by actual events, a group of 12 year ...",11/20/15,90.0,Released
3339,(500) Days of Summer,It was almost like falling in love.,60722734,7500000,19913,en,"Tom (Joseph Gordon-Levitt), greeting-card writ...",7/17/09,95.0,Released
3556,10 Cloverfield Lane,Monsters come in many forms.,108286421,15000000,333371,en,"After a car accident, Michelle awakens to find...",3/10/16,103.0,Released
2903,10 Days in a Madhouse,,0,1200000,345003,en,"Nellie Bly, a 23 year-old reporter for Joseph ...",11/20/15,111.0,Released
2739,10 Things I Hate About You,How do I loathe thee? Let me count the ways.,53478166,16000000,4951,en,"Bianca, a tenth grader, has never gone on a da...",3/30/99,97.0,Released


In [232]:
# Merge df_dummies dataframe with movie_df_modified
movie_dummies_df = pd.merge(movie_revenue, df_dummies, how='inner', left_on='title', right_on='title')
print(movie_dummies_df.shape)
movie_dummies_df.head()

(4799, 15007)


Unnamed: 0,title,tagline,revenue,budget,id,original_language,overview,release_date,runtime,status,...,spoken_lang_বাংলা,spoken_lang_ਪੰਜਾਬੀ,spoken_lang_தமிழ்,spoken_lang_తెలుగు,spoken_lang_ภาษาไทย,spoken_lang_ქართული,spoken_lang_广州话 / 廣州話,spoken_lang_日本語,spoken_lang_普通话,spoken_lang_한국어/조선말
0,#Horror,Death is trending.,0,1500000,301325,de,"Inspired by actual events, a group of 12 year ...",11/20/15,90.0,Released,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,(500) Days of Summer,It was almost like falling in love.,60722734,7500000,19913,en,"Tom (Joseph Gordon-Levitt), greeting-card writ...",7/17/09,95.0,Released,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,10 Cloverfield Lane,Monsters come in many forms.,108286421,15000000,333371,en,"After a car accident, Michelle awakens to find...",3/10/16,103.0,Released,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,10 Days in a Madhouse,,0,1200000,345003,en,"Nellie Bly, a 23 year-old reporter for Joseph ...",11/20/15,111.0,Released,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,10 Things I Hate About You,How do I loathe thee? Let me count the ways.,53478166,16000000,4951,en,"Bianca, a tenth grader, has never gone on a da...",3/30/99,97.0,Released,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


*Observations*: The overall shape of the new dataset seems correct, since the number of columns remained at 4796, as was the length of the dummies dataframes. This was accomplished by using the concatenation function followed by grouping by the title. However, the tagline column appears to be filled in with numerical values; since this column had multiple NaNs and has not been addressed yet, this numerical column holds no value. This brings to the realization that 


In addition, since the categorical columns have been processed into One Hot Vectors, there is no need for the original columns. Thus, these columns along with the current tagline column need to be dropped, and the original (non-numerical) tagline column needs to be added to the new dummies dataset.

Additionally, since the main 

### C. Apply Random Forest for *tagline* column predictions.

In order to train this model to predict NaN entries, use the columns with taglines as the training data and the NaN entries as the testing data.

In [234]:
# Import necessary modules
from sklearn.ensemble import RandomForestRegressor

In [236]:
# Separate rows with NaN from rows with taglines
train_with_tagline = movie_dummies_df[pd.isnull(movie_dummies_df['tagline']) == False]
test_with_nan = movie_dummies_df[pd.isnull(movie_dummies_df['tagline'])]

In [237]:
train_with_tagline.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3959 entries, 0 to 4798
Columns: 15007 entries, title to spoken_lang_한국어/조선말
dtypes: float64(14998), int64(3), object(6)
memory usage: 453.3+ MB


#### Encode remaining categorical variables

In [238]:
# Convert Column value strings to a numeric value
def string_to_numeric(data):
    for i, column in enumerate(list([str(d) for d in data.dtypes])):
        if column == "object":
            data.iloc[:,i] = data.iloc[:,i].fillna(data.iloc[:,i].mode())
            data.iloc[:,i] = data.iloc[:,i].astype("category").cat.codes
    return data

In [239]:
# Training data
string_to_numeric(train_with_tagline)
train_with_tagline.head(2)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


Unnamed: 0,title,tagline,revenue,budget,id,original_language,overview,release_date,runtime,status,...,spoken_lang_বাংলা,spoken_lang_ਪੰਜਾਬੀ,spoken_lang_தமிழ்,spoken_lang_తెలుగు,spoken_lang_ภาษาไทย,spoken_lang_ქართული,spoken_lang_广州话 / 廣州話,spoken_lang_日本語,spoken_lang_普通话,spoken_lang_한국어/조선말
0,0,645,0,1500000,301325,5,2191,544,90.0,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1,1569,60722734,7500000,19913,7,3443,2174,95.0,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [240]:
# Testing Data
string_to_numeric(test_with_nan)
test_with_nan.head(2)

Unnamed: 0,title,tagline,revenue,budget,id,original_language,overview,release_date,runtime,status,...,spoken_lang_বাংলা,spoken_lang_ਪੰਜਾਬੀ,spoken_lang_தமிழ்,spoken_lang_తెలుగు,spoken_lang_ภาษาไทย,spoken_lang_ქართული,spoken_lang_广州话 / 廣州話,spoken_lang_日本語,spoken_lang_普通话,spoken_lang_한국어/조선말
3,0,-1,0,1200000,345003,5,531,186,111.0,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
17,1,-1,0,0,91122,5,578,688,118.0,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Start Prediction with Random Forest Regressor

In [241]:
# Define independent and dependent variables in dataset

# Train
X_train = train_with_tagline.drop('tagline', axis=1)
y_train = train_with_tagline['tagline']

# Test
X_test = test_with_nan.drop('tagline', axis=1)
y_test = test_with_nan['tagline']

In [242]:
# Create a RFR model instance
rfr_tagline = RandomForestRegressor()

# Fit to model
rfr_tagline.fit(X_train, y_train)



RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

In [243]:
# Show predicted tagline values
generated_taglines = rfr_tagline.predict(X_test)

Cast the generated (int) values of tagline, and replace the missing tagline values with data predicted by the model

In [244]:
# Replace column contents
test_with_nan.loc[:, 'tagline'] = generated_taglines.astype(int)

# Create new movie dataframe with generated taglines
movie_generated_taglines = train_with_tagline.append(test_with_nan)
movie_generated_taglines.head(2)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


Unnamed: 0,title,tagline,revenue,budget,id,original_language,overview,release_date,runtime,status,...,spoken_lang_বাংলা,spoken_lang_ਪੰਜਾਬੀ,spoken_lang_தமிழ்,spoken_lang_తెలుగు,spoken_lang_ภาษาไทย,spoken_lang_ქართული,spoken_lang_广州话 / 廣州話,spoken_lang_日本語,spoken_lang_普通话,spoken_lang_한국어/조선말
0,0,645,0,1500000,301325,5,2191,544,90.0,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1,1569,60722734,7500000,19913,7,3443,2174,95.0,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [245]:
# Reset index
movie_generated_taglines.reset_index(inplace=True)

In [246]:
# Drop index column
movie_generated_taglines.drop('index',inplace=True,axis=1)
movie_generated_taglines.head(2)

Unnamed: 0,title,tagline,revenue,budget,id,original_language,overview,release_date,runtime,status,...,spoken_lang_বাংলা,spoken_lang_ਪੰਜਾਬੀ,spoken_lang_தமிழ்,spoken_lang_తెలుగు,spoken_lang_ภาษาไทย,spoken_lang_ქართული,spoken_lang_广州话 / 廣州話,spoken_lang_日本語,spoken_lang_普通话,spoken_lang_한국어/조선말
0,0,645,0,1500000,301325,5,2191,544,90.0,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1,1569,60722734,7500000,19913,7,3443,2174,95.0,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [247]:
# Store cleaned dataframe
movies_clean = movie_generated_taglines.copy()
%store movies_clean

Stored 'movies_clean' (DataFrame)
