<a href="https://colab.research.google.com/drive/1RryiGSfzqvGaCrBZ0EAujdLa5KZfRcB2?usp=sharing">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

This tutorial needs data so if you are working on colab follow the below data setup instruction

# Data Setup Instructions

These are the instructions for mounting the data from google drive to colab and accessing it in the colab.

STEP 1 - After opening the tutorial in  your colab, go to folder button and click on mount google drive

STEP 2 - drive folder will be mounted in the current directory of /content, you can access it as below 

In [1]:
# print current directory
%pwd

'/content'

In [2]:
%ls

[0m[01;34mdrive[0m/  [01;34msample_data[0m/


STEP 3 - Find your data folder where you saved the data and sym link it to /content folder so as to simplify data access

In the current case the Data folder is located at this path in google drive (Use your own data path in your case)

/content/drive/Othercomputers/My MacBook Pro/Data/

We can sym link it to /content folder using the following command

In [3]:
# sym linked the original data folder to new location at /content
!ln -s "/content/drive/Othercomputers/My MacBook Pro/Data" "/content"

Now we can access the data from this folder by simply giving the file path name after /Data

# Importing pandas library and data loading

In [4]:
import pandas as pd

In this lesson we are not using movies dataframe but the one we created with no missing worldwide_gross_income values.

We have saved the above dataframe in the last lesson in the file 

'movies_cleaned_lesson3.csv'.

You can read this file in the below way (the path is taken from the previous lesson)

In [5]:
# if you are working with this tutorial on local machine use the file path where the data is saved in your computer
movies_cleaned_gross_income = pd.read_csv("Data/IMDB_rotten_tomato_dataset/IMDB/cleaned_files/movies_cleaned_lesson3.csv")
# We can use .head command to quickly observe the first 5 rows of the dataset
movies_cleaned_gross_income.head()

Unnamed: 0,imdb_title_id,original_title,year,date_published,genre,duration,country,language,imdb_score,votes,budget,usa_gross_income,worldwide_gross_income,metascore,movie_age
0,tt0007183,Pikovaya dama,1916,1916-04-01,"Drama, Fantasy, Horror",63,Russia,Russian,7.0,610,,,$ 144968,,105
1,tt0010323,Das Cabinet des Dr. Caligari,1920,1920-02-27,"Fantasy, Horror, Mystery",76,Germany,German,8.1,55601,$ 18000,$ 8811,$ 8811,,101
2,tt0011440,Markens grøde,1921,1921-12-02,Drama,107,Norway,,6.6,195,NOK 250000,,$ 4272,,100
3,tt0011741,Suds,1920,1920-01-27,"Comedy, Drama, Romance",75,USA,English,6.3,210,,,$ 772155,,101
4,tt0012190,The Four Horsemen of the Apocalypse,1921,1923-04-16,"Drama, Romance, War",150,USA,,7.2,3058,$ 800000,$ 9183673,$ 9183673,,100


### Iterative nature of Preprocessing and cleaning methods

In order to show you that proprocessing and cleaning process in a real world problem is not very straight forward and it is a iterative process.

we will try to change the datatype of 'worldwide_gross_income'. 

Currently worldwide_gross_income is an object column but since these are income numbers so it would be better if we change it's datatype to numbers

In [6]:
movies_cleaned_gross_income.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31016 entries, 0 to 31015
Data columns (total 15 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   imdb_title_id           31016 non-null  object 
 1   original_title          31016 non-null  object 
 2   year                    31016 non-null  int64  
 3   date_published          31016 non-null  object 
 4   genre                   31016 non-null  object 
 5   duration                31016 non-null  int64  
 6   country                 31001 non-null  object 
 7   language                30862 non-null  object 
 8   imdb_score              31016 non-null  float64
 9   votes                   31016 non-null  int64  
 10  budget                  12762 non-null  object 
 11  usa_gross_income        14168 non-null  object 
 12  worldwide_gross_income  31016 non-null  object 
 13  metascore               11317 non-null  float64
 14  movie_age               31016 non-null

In [7]:
movies_cleaned_gross_income['worldwide_gross_income']

0         $ 144968
1           $ 8811
2           $ 4272
3         $ 772155
4        $ 9183673
           ...    
31011       $ 4791
31012    $ 3507171
31013    $ 7299062
31014       $ 2833
31015      $ 59794
Name: worldwide_gross_income, Length: 31016, dtype: object

We cannot convert it to number datatype column if there is a dollar sign attached to every number here.

If we remove dollar sign, then we will be able to change this columns datatype.

for removing dollar we can use apply function to create a new column for worldwide_gross_income with no dollar sign as shown below

In [8]:
def remove_dollar_sign(gross_income):
    return gross_income.replace('$','')

If our code line goes beyond one line in colab,we can use \ sign to make the line visible in a single cell.

This is shown in below command where we are trying to create a new column 'worldwide_gross_income_numbered'

In [9]:
movies_cleaned_gross_income['worldwide_gross_income_numbered'] = movies_cleaned_gross_income.apply(lambda \
                                                    row: remove_dollar_sign(row['worldwide_gross_income']),axis=1)

movies_cleaned_gross_income

Unnamed: 0,imdb_title_id,original_title,year,date_published,genre,duration,country,language,imdb_score,votes,budget,usa_gross_income,worldwide_gross_income,metascore,movie_age,worldwide_gross_income_numbered
0,tt0007183,Pikovaya dama,1916,1916-04-01,"Drama, Fantasy, Horror",63,Russia,Russian,7.0,610,,,$ 144968,,105,144968
1,tt0010323,Das Cabinet des Dr. Caligari,1920,1920-02-27,"Fantasy, Horror, Mystery",76,Germany,German,8.1,55601,$ 18000,$ 8811,$ 8811,,101,8811
2,tt0011440,Markens grøde,1921,1921-12-02,Drama,107,Norway,,6.6,195,NOK 250000,,$ 4272,,100,4272
3,tt0011741,Suds,1920,1920-01-27,"Comedy, Drama, Romance",75,USA,English,6.3,210,,,$ 772155,,101,772155
4,tt0012190,The Four Horsemen of the Apocalypse,1921,1923-04-16,"Drama, Romance, War",150,USA,,7.2,3058,$ 800000,$ 9183673,$ 9183673,,100,9183673
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
31011,tt9905412,Ottam,2019,2019-03-08,Drama,120,India,Malayalam,7.4,494,INR 4000000,,$ 4791,,2,4791
31012,tt9908390,Le lion,2020,2020-01-29,Comedy,95,"France, Belgium",French,5.3,398,,,$ 3507171,,1,3507171
31013,tt9911196,De Beentjes van Sint-Hildegard,2020,2020-02-13,"Comedy, Drama",103,Netherlands,"German, Dutch",7.7,724,,,$ 7299062,,1,7299062
31014,tt9914286,Sokagin Çocuklari,2019,2019-03-15,"Drama, Family",98,Turkey,Turkish,6.4,194,,,$ 2833,,2,2833


Now we can convert this column to number datatype

In [10]:
# you will get an error here
movies_cleaned_gross_income['worldwide_gross_income_numbered'] = movies_cleaned_gross_income[\
                                                              'worldwide_gross_income_numbered'].astype(int)


ValueError: ignored

You will see in above code that there are still currency symbols other than dollar sign.

So our initial assumption that there are only dollar signs in this column is wrong. And now we need to first filter out rows with gross income in currencies other than dollar.

One more reason for removing such currencies is that gross income values in two different currencies can not be actually compared.

**The above problem shows that when we are doing preprocessing we may need to work on it in an iterative fashion.**

We will first drop the 'worldwide_gross_income_numbered' column

In [11]:
movies_cleaned_gross_income.drop(['worldwide_gross_income_numbered'],axis=1,inplace=True)

We will use filtering method to filter out rows where we do not have \$ sign in worldwide_gross_income column.

We will use .str.startswith method to get only those rows starting with $ sign in worldwide_gross_income.

and also create a new dataframe with this filtered dataset.

In [12]:
movies_cleaned_gross_income_dol =  movies_cleaned_gross_income.loc[\
                                movies_cleaned_gross_income['worldwide_gross_income'].str.startswith('$')].copy()

Let's see the shape of the two dataframes above, you will see the newly created dataframe has some less number of rows than original one.

In [13]:
print(movies_cleaned_gross_income_dol.shape)
print(movies_cleaned_gross_income.shape)

(30955, 15)
(31016, 15)


Now in the new dataframe all the rows in worldwide_gross_income is of dollar sign, we can apply the previous method of creating a new column and then coverting that to number datatype

In [14]:
movies_cleaned_gross_income_dol['worldwide_gross_income_numbered'] = movies_cleaned_gross_income_dol.apply(lambda \
                                                    row: remove_dollar_sign(row['worldwide_gross_income']),axis=1)

movies_cleaned_gross_income_dol

Unnamed: 0,imdb_title_id,original_title,year,date_published,genre,duration,country,language,imdb_score,votes,budget,usa_gross_income,worldwide_gross_income,metascore,movie_age,worldwide_gross_income_numbered
0,tt0007183,Pikovaya dama,1916,1916-04-01,"Drama, Fantasy, Horror",63,Russia,Russian,7.0,610,,,$ 144968,,105,144968
1,tt0010323,Das Cabinet des Dr. Caligari,1920,1920-02-27,"Fantasy, Horror, Mystery",76,Germany,German,8.1,55601,$ 18000,$ 8811,$ 8811,,101,8811
2,tt0011440,Markens grøde,1921,1921-12-02,Drama,107,Norway,,6.6,195,NOK 250000,,$ 4272,,100,4272
3,tt0011741,Suds,1920,1920-01-27,"Comedy, Drama, Romance",75,USA,English,6.3,210,,,$ 772155,,101,772155
4,tt0012190,The Four Horsemen of the Apocalypse,1921,1923-04-16,"Drama, Romance, War",150,USA,,7.2,3058,$ 800000,$ 9183673,$ 9183673,,100,9183673
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
31011,tt9905412,Ottam,2019,2019-03-08,Drama,120,India,Malayalam,7.4,494,INR 4000000,,$ 4791,,2,4791
31012,tt9908390,Le lion,2020,2020-01-29,Comedy,95,"France, Belgium",French,5.3,398,,,$ 3507171,,1,3507171
31013,tt9911196,De Beentjes van Sint-Hildegard,2020,2020-02-13,"Comedy, Drama",103,Netherlands,"German, Dutch",7.7,724,,,$ 7299062,,1,7299062
31014,tt9914286,Sokagin Çocuklari,2019,2019-03-15,"Drama, Family",98,Turkey,Turkish,6.4,194,,,$ 2833,,2,2833


In [15]:
movies_cleaned_gross_income_dol['worldwide_gross_income_numbered'] = movies_cleaned_gross_income_dol[\
                                                              'worldwide_gross_income_numbered'].astype(int)


In [16]:
movies_cleaned_gross_income_dol.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 30955 entries, 0 to 31015
Data columns (total 16 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   imdb_title_id                    30955 non-null  object 
 1   original_title                   30955 non-null  object 
 2   year                             30955 non-null  int64  
 3   date_published                   30955 non-null  object 
 4   genre                            30955 non-null  object 
 5   duration                         30955 non-null  int64  
 6   country                          30940 non-null  object 
 7   language                         30801 non-null  object 
 8   imdb_score                       30955 non-null  float64
 9   votes                            30955 non-null  int64  
 10  budget                           12762 non-null  object 
 11  usa_gross_income                 14166 non-null  object 
 12  worldwide_gross_in

We will now drop the original worldwide_gross_income columns. 

Also rename the column worldwide_gross_income_numbered as worldwide_gross_income.

In [17]:
movies_cleaned_gross_income_dol.drop(['worldwide_gross_income'],axis=1,inplace=True)
movies_cleaned_gross_income_dol.rename({'worldwide_gross_income_numbered':'worldwide_gross_income'},axis=1,inplace=True)

In [18]:
#let's see the head of the dataframe
movies_cleaned_gross_income_dol.head()

Unnamed: 0,imdb_title_id,original_title,year,date_published,genre,duration,country,language,imdb_score,votes,budget,usa_gross_income,metascore,movie_age,worldwide_gross_income
0,tt0007183,Pikovaya dama,1916,1916-04-01,"Drama, Fantasy, Horror",63,Russia,Russian,7.0,610,,,,105,144968
1,tt0010323,Das Cabinet des Dr. Caligari,1920,1920-02-27,"Fantasy, Horror, Mystery",76,Germany,German,8.1,55601,$ 18000,$ 8811,,101,8811
2,tt0011440,Markens grøde,1921,1921-12-02,Drama,107,Norway,,6.6,195,NOK 250000,,,100,4272
3,tt0011741,Suds,1920,1920-01-27,"Comedy, Drama, Romance",75,USA,English,6.3,210,,,,101,772155
4,tt0012190,The Four Horsemen of the Apocalypse,1921,1923-04-16,"Drama, Romance, War",150,USA,,7.2,3058,$ 800000,$ 9183673,,100,9183673


# Saving the cleaned data

In this lesson we have created the dataframe movies_cleaned_gross_income_dol from our previous lesson's dataframe movies_cleaned_gross_income.

We will save the dataframe 'movies_cleaned_gross_income_dol' in a new file called 

'movies_cleaned_lesson4.csv'

in the folder cleaned_files.

Final path of the saved file would be - 'Data/IMDB_rotten_tomato_dataset/IMDB/cleaned_files/movies_cleaned_lesson4.csv'

In [19]:
# don't forget to put index = false while saving the data frame in a csv file
movies_cleaned_gross_income_dol.to_csv('Data/IMDB_rotten_tomato_dataset/IMDB/cleaned_files/movies_cleaned_lesson4.csv', index = False)

We will use this file in the next chapter on Advanced Data Analysis.