# Data Wrangling - (Movies Dataset)
## by Tamara Gray

## Introduction

### Movies Dataset Description

The movies dataset will be taken through the process of data wrangling to prepare the data for further analaysis in order to provide accurate insights. The dataset contains information for various movies ranging over multiple genres.

### Data Cleaning Process

In order to get the dataset clean for analysis, the following will be done:
1. Make all the columns names to lowercase
2. Delete the irrelevant columns
3. Remove the special characters from the column names and rename columns 
4. Format the release year to just show the year of the film and change the datatype
5. Standardized the country column to one prefix
6. Remove the dollar sign, change letter to number in income, take out commas, drop Null values and convert income     to an interger 
7. Deal with missing values, inconsistent values, NaN values and change duration datatype to an integer
8. Have just one genre for each movie
9. Modify the score column to a standard format and change to an integer
10. Modify the votes column and change to an integer
11. ## Replace Null values in content_rating with 'Unrated'

In [485]:
## Import relevant libraries and file
import pandas as pd
movies = pd.read_csv('pythonassignment.csv')

In [486]:
movies

Unnamed: 0.1,Unnamed: 0,IMBD title ID,Original titlÊ,Release year,Genrë¨,Duration,Country,Content Rating,Director,Unnamed: 8,Income,Votes,Score
0,0,tt0111161,The Shawshank Redemption,1995-02-10,Drama,142,USA,R,Frank Darabont,,$ 28815245,2.278.845,9.3
1,1,tt0068646,The Godfather,09 21 1972,"Crime, Drama",175,USA,R,Francis Ford Coppola,,$ 246120974,1.572.674,9.2
2,2,tt0468569,The Dark Knight,23 -07-2008,"Action, Crime, Drama",152,US,PG-13,Christopher Nolan,,$ 1005455211,2.241.615,9.
3,3,tt0071562,The Godfather: Part II,1975-09-25,"Crime, Drama",220,USA,R,Francis Ford Coppola,,"$ 4o8,035,783",1.098.714,9.0
4,4,tt0110912,Pulp Fiction,1994-10-28,"Crime, Drama",,USA,R,Quentin Tarantino,,$ 222831817,1.780.147,"8,9f"
...,...,...,...,...,...,...,...,...,...,...,...,...,...
96,96,tt0070735,The Sting,1974-03-21,"Comedy, Crime, Drama",129,USA,PG,George Roy Hill,,$ 156000000,236.285,7.5
97,97,tt0082096,Das Boot,1982-03-18,"Adventure, Drama, Thriller",149,West Germany,R,Wolfgang Petersen,,$ 11487676,226.427,7.5
98,98,tt0059578,Per qualche dollaro in piÃ¹,1965-12-20,Western,132,Italy,,Sergio Leone,,$ 15000000,226.039,7.4
99,99,tt1832382,Jodaeiye Nader az Simin,2011-10-21,Drama,123,Iran,PG-13,Asghar Farhadi,,$ 22926076,214.165,7.4


In [488]:
## reviewing the information of the columns
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 101 entries, 0 to 100
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Unnamed: 0      101 non-null    int64  
 1   IMBD title ID   100 non-null    object 
 2   Original titlÊ  100 non-null    object 
 3   Release year    100 non-null    object 
 4   Genrë¨          100 non-null    object 
 5   Duration        99 non-null     object 
 6   Country         100 non-null    object 
 7   Content Rating  77 non-null     object 
 8   Director        100 non-null    object 
 9   Unnamed: 8      0 non-null      float64
 10  Income          100 non-null    object 
 11   Votes          100 non-null    object 
 12  Score           100 non-null    object 
dtypes: float64(1), int64(1), object(11)
memory usage: 10.4+ KB


In [489]:
## Make all the columns names to lowercase
movies.columns = movies.columns.str.lower()

In [490]:
movies

Unnamed: 0,unnamed: 0,imbd title id,original titlê,release year,genrë¨,duration,country,content rating,director,unnamed: 8,income,votes,score
0,0,tt0111161,The Shawshank Redemption,1995-02-10,Drama,142,USA,R,Frank Darabont,,$ 28815245,2.278.845,9.3
1,1,tt0068646,The Godfather,09 21 1972,"Crime, Drama",175,USA,R,Francis Ford Coppola,,$ 246120974,1.572.674,9.2
2,2,tt0468569,The Dark Knight,23 -07-2008,"Action, Crime, Drama",152,US,PG-13,Christopher Nolan,,$ 1005455211,2.241.615,9.
3,3,tt0071562,The Godfather: Part II,1975-09-25,"Crime, Drama",220,USA,R,Francis Ford Coppola,,"$ 4o8,035,783",1.098.714,9.0
4,4,tt0110912,Pulp Fiction,1994-10-28,"Crime, Drama",,USA,R,Quentin Tarantino,,$ 222831817,1.780.147,"8,9f"
...,...,...,...,...,...,...,...,...,...,...,...,...,...
96,96,tt0070735,The Sting,1974-03-21,"Comedy, Crime, Drama",129,USA,PG,George Roy Hill,,$ 156000000,236.285,7.5
97,97,tt0082096,Das Boot,1982-03-18,"Adventure, Drama, Thriller",149,West Germany,R,Wolfgang Petersen,,$ 11487676,226.427,7.5
98,98,tt0059578,Per qualche dollaro in piÃ¹,1965-12-20,Western,132,Italy,,Sergio Leone,,$ 15000000,226.039,7.4
99,99,tt1832382,Jodaeiye Nader az Simin,2011-10-21,Drama,123,Iran,PG-13,Asghar Farhadi,,$ 22926076,214.165,7.4


In [491]:
## Delete the irrelevant columns
movies.drop(['unnamed: 0','imbd title id', 'unnamed: 8'],axis = 1, inplace = True)

In [492]:
movies

Unnamed: 0,original titlê,release year,genrë¨,duration,country,content rating,director,income,votes,score
0,The Shawshank Redemption,1995-02-10,Drama,142,USA,R,Frank Darabont,$ 28815245,2.278.845,9.3
1,The Godfather,09 21 1972,"Crime, Drama",175,USA,R,Francis Ford Coppola,$ 246120974,1.572.674,9.2
2,The Dark Knight,23 -07-2008,"Action, Crime, Drama",152,US,PG-13,Christopher Nolan,$ 1005455211,2.241.615,9.
3,The Godfather: Part II,1975-09-25,"Crime, Drama",220,USA,R,Francis Ford Coppola,"$ 4o8,035,783",1.098.714,9.0
4,Pulp Fiction,1994-10-28,"Crime, Drama",,USA,R,Quentin Tarantino,$ 222831817,1.780.147,"8,9f"
...,...,...,...,...,...,...,...,...,...,...
96,The Sting,1974-03-21,"Comedy, Crime, Drama",129,USA,PG,George Roy Hill,$ 156000000,236.285,7.5
97,Das Boot,1982-03-18,"Adventure, Drama, Thriller",149,West Germany,R,Wolfgang Petersen,$ 11487676,226.427,7.5
98,Per qualche dollaro in piÃ¹,1965-12-20,Western,132,Italy,,Sergio Leone,$ 15000000,226.039,7.4
99,Jodaeiye Nader az Simin,2011-10-21,Drama,123,Iran,PG-13,Asghar Farhadi,$ 22926076,214.165,7.4


In [493]:
## Remove the special characters from the column names and rename the columns
movies.rename(columns = {'original titlê':'original_title'}, inplace = True)
movies.rename(columns = {'genrë¨':'genre'}, inplace = True) 
movies.rename(columns = {'release year':'release_year'}, inplace = True) 
movies.rename(columns = {'content rating':'content_rating'}, inplace = True) 

In [494]:
movies

Unnamed: 0,original_title,release_year,genre,duration,country,content_rating,director,income,votes,score
0,The Shawshank Redemption,1995-02-10,Drama,142,USA,R,Frank Darabont,$ 28815245,2.278.845,9.3
1,The Godfather,09 21 1972,"Crime, Drama",175,USA,R,Francis Ford Coppola,$ 246120974,1.572.674,9.2
2,The Dark Knight,23 -07-2008,"Action, Crime, Drama",152,US,PG-13,Christopher Nolan,$ 1005455211,2.241.615,9.
3,The Godfather: Part II,1975-09-25,"Crime, Drama",220,USA,R,Francis Ford Coppola,"$ 4o8,035,783",1.098.714,9.0
4,Pulp Fiction,1994-10-28,"Crime, Drama",,USA,R,Quentin Tarantino,$ 222831817,1.780.147,"8,9f"
...,...,...,...,...,...,...,...,...,...,...
96,The Sting,1974-03-21,"Comedy, Crime, Drama",129,USA,PG,George Roy Hill,$ 156000000,236.285,7.5
97,Das Boot,1982-03-18,"Adventure, Drama, Thriller",149,West Germany,R,Wolfgang Petersen,$ 11487676,226.427,7.5
98,Per qualche dollaro in piÃ¹,1965-12-20,Western,132,Italy,,Sergio Leone,$ 15000000,226.039,7.4
99,Jodaeiye Nader az Simin,2011-10-21,Drama,123,Iran,PG-13,Asghar Farhadi,$ 22926076,214.165,7.4


In [495]:
## Format the release year to just show the year of the film and change the datatype
movies['release_year'].replace({'22 Feb 04': '2022-02-04'}, inplace=True) 
movies['release_year'].replace({'10-29-99':'1999-10-29'}, inplace = True) 
movies['release_year'].replace({'01/16-03':'2003-01-16'}, inplace = True) 
movies['release_year'].replace({'21-11-46':'2021-11-46'}, inplace = True) 
movies['release_year'].replace({'The 6th of marzo, year 1951':'1951-03-06'}, inplace = True) 

## showing the year of film only
movies['release_year'] = movies['release_year'].str.extract(r'(\d{4})')
             


In [496]:
movies 

Unnamed: 0,original_title,release_year,genre,duration,country,content_rating,director,income,votes,score
0,The Shawshank Redemption,1995,Drama,142,USA,R,Frank Darabont,$ 28815245,2.278.845,9.3
1,The Godfather,1972,"Crime, Drama",175,USA,R,Francis Ford Coppola,$ 246120974,1.572.674,9.2
2,The Dark Knight,2008,"Action, Crime, Drama",152,US,PG-13,Christopher Nolan,$ 1005455211,2.241.615,9.
3,The Godfather: Part II,1975,"Crime, Drama",220,USA,R,Francis Ford Coppola,"$ 4o8,035,783",1.098.714,9.0
4,Pulp Fiction,1994,"Crime, Drama",,USA,R,Quentin Tarantino,$ 222831817,1.780.147,"8,9f"
...,...,...,...,...,...,...,...,...,...,...
96,The Sting,1974,"Comedy, Crime, Drama",129,USA,PG,George Roy Hill,$ 156000000,236.285,7.5
97,Das Boot,1982,"Adventure, Drama, Thriller",149,West Germany,R,Wolfgang Petersen,$ 11487676,226.427,7.5
98,Per qualche dollaro in piÃ¹,1965,Western,132,Italy,,Sergio Leone,$ 15000000,226.039,7.4
99,Jodaeiye Nader az Simin,2011,Drama,123,Iran,PG-13,Asghar Farhadi,$ 22926076,214.165,7.4


In [497]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 101 entries, 0 to 100
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   original_title  100 non-null    object
 1   release_year    100 non-null    object
 2   genre           100 non-null    object
 3   duration        99 non-null     object
 4   country         100 non-null    object
 5   content_rating  77 non-null     object
 6   director        100 non-null    object
 7   income          100 non-null    object
 8    votes          100 non-null    object
 9   score           100 non-null    object
dtypes: object(10)
memory usage: 8.0+ KB


In [498]:
## changing data type from object to datetime
movies.release_year = pd.to_datetime(movies.release_year, errors='coerce')

In [499]:
## checking the info for the dataframe
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 101 entries, 0 to 100
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   original_title  100 non-null    object        
 1   release_year    100 non-null    datetime64[ns]
 2   genre           100 non-null    object        
 3   duration        99 non-null     object        
 4   country         100 non-null    object        
 5   content_rating  77 non-null     object        
 6   director        100 non-null    object        
 7   income          100 non-null    object        
 8    votes          100 non-null    object        
 9   score           100 non-null    object        
dtypes: datetime64[ns](1), object(9)
memory usage: 8.0+ KB


In [500]:
movies

Unnamed: 0,original_title,release_year,genre,duration,country,content_rating,director,income,votes,score
0,The Shawshank Redemption,1995-01-01,Drama,142,USA,R,Frank Darabont,$ 28815245,2.278.845,9.3
1,The Godfather,1972-01-01,"Crime, Drama",175,USA,R,Francis Ford Coppola,$ 246120974,1.572.674,9.2
2,The Dark Knight,2008-01-01,"Action, Crime, Drama",152,US,PG-13,Christopher Nolan,$ 1005455211,2.241.615,9.
3,The Godfather: Part II,1975-01-01,"Crime, Drama",220,USA,R,Francis Ford Coppola,"$ 4o8,035,783",1.098.714,9.0
4,Pulp Fiction,1994-01-01,"Crime, Drama",,USA,R,Quentin Tarantino,$ 222831817,1.780.147,"8,9f"
...,...,...,...,...,...,...,...,...,...,...
96,The Sting,1974-01-01,"Comedy, Crime, Drama",129,USA,PG,George Roy Hill,$ 156000000,236.285,7.5
97,Das Boot,1982-01-01,"Adventure, Drama, Thriller",149,West Germany,R,Wolfgang Petersen,$ 11487676,226.427,7.5
98,Per qualche dollaro in piÃ¹,1965-01-01,Western,132,Italy,,Sergio Leone,$ 15000000,226.039,7.4
99,Jodaeiye Nader az Simin,2011-01-01,Drama,123,Iran,PG-13,Asghar Farhadi,$ 22926076,214.165,7.4


In [501]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 101 entries, 0 to 100
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   original_title  100 non-null    object        
 1   release_year    100 non-null    datetime64[ns]
 2   genre           100 non-null    object        
 3   duration        99 non-null     object        
 4   country         100 non-null    object        
 5   content_rating  77 non-null     object        
 6   director        100 non-null    object        
 7   income          100 non-null    object        
 8    votes          100 non-null    object        
 9   score           100 non-null    object        
dtypes: datetime64[ns](1), object(9)
memory usage: 8.0+ KB


In [502]:
movies

Unnamed: 0,original_title,release_year,genre,duration,country,content_rating,director,income,votes,score
0,The Shawshank Redemption,1995-01-01,Drama,142,USA,R,Frank Darabont,$ 28815245,2.278.845,9.3
1,The Godfather,1972-01-01,"Crime, Drama",175,USA,R,Francis Ford Coppola,$ 246120974,1.572.674,9.2
2,The Dark Knight,2008-01-01,"Action, Crime, Drama",152,US,PG-13,Christopher Nolan,$ 1005455211,2.241.615,9.
3,The Godfather: Part II,1975-01-01,"Crime, Drama",220,USA,R,Francis Ford Coppola,"$ 4o8,035,783",1.098.714,9.0
4,Pulp Fiction,1994-01-01,"Crime, Drama",,USA,R,Quentin Tarantino,$ 222831817,1.780.147,"8,9f"
...,...,...,...,...,...,...,...,...,...,...
96,The Sting,1974-01-01,"Comedy, Crime, Drama",129,USA,PG,George Roy Hill,$ 156000000,236.285,7.5
97,Das Boot,1982-01-01,"Adventure, Drama, Thriller",149,West Germany,R,Wolfgang Petersen,$ 11487676,226.427,7.5
98,Per qualche dollaro in piÃ¹,1965-01-01,Western,132,Italy,,Sergio Leone,$ 15000000,226.039,7.4
99,Jodaeiye Nader az Simin,2011-01-01,Drama,123,Iran,PG-13,Asghar Farhadi,$ 22926076,214.165,7.4


In [503]:
# Standardized the country column to one prefix. First, check the count of countries
movies.country.value_counts()

USA             62
UK              12
Italy            4
Japan            4
France           3
South Korea      2
Germany          2
New Zesland      1
New Zealand      1
New Zeland       1
US.              1
Brazil           1
US               1
Italy1           1
India            1
Denmark          1
West Germany     1
Iran             1
Name: country, dtype: int64

In [504]:
# changing countries to one prefix
movies.country.replace({'US':'USA', 'US.':'USA', 'New Zesland':'New Zealand', 'New Zeland':'New Zealand',
                        'NaN':'None', 'Italy1':'Italy', 'West Germany': 'Germany'}, inplace=True) 

In [505]:
movies.country.value_counts()

USA            64
UK             12
Italy           5
Japan           4
New Zealand     3
France          3
Germany         3
South Korea     2
Brazil          1
India           1
Denmark         1
Iran            1
Name: country, dtype: int64

In [507]:
# checking the info on income
movies.income

0         $ 28815245
1        $ 246120974
2       $ 1005455211
3      $ 4o8,035,783
4        $ 222831817
           ...      
96       $ 156000000
97        $ 11487676
98        $ 15000000
99        $ 22926076
100        $ 1864182
Name: income, Length: 101, dtype: object

In [508]:
## Remove the dollar sign, change letter to number in income, take out commas, drop Null values and convert income to int
movies.income = movies['income'].str.replace('$', ' ', regex=False) 
movies.income = movies['income'].str.replace('o', '0')
movies.income = movies['income'].str.replace(",", "")
movies.dropna(subset=['income'], inplace=True)
movies['income'] = movies['income'].astype(int)


In [509]:
## checking the info for the movies dataframe
movies.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 100 entries, 0 to 100
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   original_title  100 non-null    object        
 1   release_year    100 non-null    datetime64[ns]
 2   genre           100 non-null    object        
 3   duration        99 non-null     object        
 4   country         100 non-null    object        
 5   content_rating  77 non-null     object        
 6   director        100 non-null    object        
 7   income          100 non-null    int64         
 8    votes          100 non-null    object        
 9   score           100 non-null    object        
dtypes: datetime64[ns](1), int64(1), object(8)
memory usage: 8.6+ KB


In [510]:
movies

Unnamed: 0,original_title,release_year,genre,duration,country,content_rating,director,income,votes,score
0,The Shawshank Redemption,1995-01-01,Drama,142,USA,R,Frank Darabont,28815245,2.278.845,9.3
1,The Godfather,1972-01-01,"Crime, Drama",175,USA,R,Francis Ford Coppola,246120974,1.572.674,9.2
2,The Dark Knight,2008-01-01,"Action, Crime, Drama",152,USA,PG-13,Christopher Nolan,1005455211,2.241.615,9.
3,The Godfather: Part II,1975-01-01,"Crime, Drama",220,USA,R,Francis Ford Coppola,408035783,1.098.714,9.0
4,Pulp Fiction,1994-01-01,"Crime, Drama",,USA,R,Quentin Tarantino,222831817,1.780.147,"8,9f"
...,...,...,...,...,...,...,...,...,...,...
96,The Sting,1974-01-01,"Comedy, Crime, Drama",129,USA,PG,George Roy Hill,156000000,236.285,7.5
97,Das Boot,1982-01-01,"Adventure, Drama, Thriller",149,Germany,R,Wolfgang Petersen,11487676,226.427,7.5
98,Per qualche dollaro in piÃ¹,1965-01-01,Western,132,Italy,,Sergio Leone,15000000,226.039,7.4
99,Jodaeiye Nader az Simin,2011-01-01,Drama,123,Iran,PG-13,Asghar Farhadi,22926076,214.165,7.4


In [511]:
## Deal with missing values, inconsistent values, NaN values and change duration datatype to an integer
movies.duration.replace({'Nan': '0', 'Inf': '0', '178c': '0', '-': '0', 'Not Applicable': '0', ' ': '0'}, inplace=True)
movies.duration.fillna('0', inplace=True)
movies.duration = movies.duration.astype(int)


In [512]:
## checking the info for the movies dataframe
movies.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 100 entries, 0 to 100
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   original_title  100 non-null    object        
 1   release_year    100 non-null    datetime64[ns]
 2   genre           100 non-null    object        
 3   duration        100 non-null    int64         
 4   country         100 non-null    object        
 5   content_rating  77 non-null     object        
 6   director        100 non-null    object        
 7   income          100 non-null    int64         
 8    votes          100 non-null    object        
 9   score           100 non-null    object        
dtypes: datetime64[ns](1), int64(2), object(7)
memory usage: 8.6+ KB


In [513]:
## Have just one genre for each movie
movies['genre'] = movies['genre'].str.split(',').str[0]

In [514]:
movies

Unnamed: 0,original_title,release_year,genre,duration,country,content_rating,director,income,votes,score
0,The Shawshank Redemption,1995-01-01,Drama,142,USA,R,Frank Darabont,28815245,2.278.845,9.3
1,The Godfather,1972-01-01,Crime,175,USA,R,Francis Ford Coppola,246120974,1.572.674,9.2
2,The Dark Knight,2008-01-01,Action,152,USA,PG-13,Christopher Nolan,1005455211,2.241.615,9.
3,The Godfather: Part II,1975-01-01,Crime,220,USA,R,Francis Ford Coppola,408035783,1.098.714,9.0
4,Pulp Fiction,1994-01-01,Crime,0,USA,R,Quentin Tarantino,222831817,1.780.147,"8,9f"
...,...,...,...,...,...,...,...,...,...,...
96,The Sting,1974-01-01,Comedy,129,USA,PG,George Roy Hill,156000000,236.285,7.5
97,Das Boot,1982-01-01,Adventure,149,Germany,R,Wolfgang Petersen,11487676,226.427,7.5
98,Per qualche dollaro in piÃ¹,1965-01-01,Western,132,Italy,,Sergio Leone,15000000,226.039,7.4
99,Jodaeiye Nader az Simin,2011-01-01,Drama,123,Iran,PG-13,Asghar Farhadi,22926076,214.165,7.4


In [518]:
## Modify the score column to a standard format and change to an integer. 1st get rid of non-numeric numbers and other characters.
movies['score'] = movies['score'].str.replace(",", "")
movies['score'] = movies['score'].str.replace("f", "")
movies['score'] = movies['score'].str.replace("e-0", "")
movies['score'] = movies['score'].str.replace("89", "8.9")
movies['score'] = movies['score'].str.replace("86", "8.6")
movies['score'] = movies['score'].str.replace("87.0", "8.7", regex=False)
movies['score'] = movies['score'].str.replace("8..8", "8.8", regex=False)
movies['score'] = movies['score'].str.replace("8:8", "8.8", regex=False)
movies['score'] = movies['score'].str.replace("\+\+8\.7", "8.7", regex=False)
movies['score'] = movies['score'].str.replace("8.7.", "8.7", regex=False)
movies['score'] = movies['score'].str.replace("++8.7", "8.7", regex=False)

## changing datatype from object to float
movies['score'] = movies['score'].astype(float)






In [519]:
movies

Unnamed: 0,original_title,release_year,genre,duration,country,content_rating,director,income,votes,score
0,The Shawshank Redemption,1995-01-01,Drama,142,USA,R,Frank Darabont,28815245,2.278.845,9.3
1,The Godfather,1972-01-01,Crime,175,USA,R,Francis Ford Coppola,246120974,1.572.674,9.2
2,The Dark Knight,2008-01-01,Action,152,USA,PG-13,Christopher Nolan,1005455211,2.241.615,9.0
3,The Godfather: Part II,1975-01-01,Crime,220,USA,R,Francis Ford Coppola,408035783,1.098.714,9.0
4,Pulp Fiction,1994-01-01,Crime,0,USA,R,Quentin Tarantino,222831817,1.780.147,8.9
...,...,...,...,...,...,...,...,...,...,...
96,The Sting,1974-01-01,Comedy,129,USA,PG,George Roy Hill,156000000,236.285,7.5
97,Das Boot,1982-01-01,Adventure,149,Germany,R,Wolfgang Petersen,11487676,226.427,7.5
98,Per qualche dollaro in piÃ¹,1965-01-01,Western,132,Italy,,Sergio Leone,15000000,226.039,7.4
99,Jodaeiye Nader az Simin,2011-01-01,Drama,123,Iran,PG-13,Asghar Farhadi,22926076,214.165,7.4


In [520]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 100 entries, 0 to 100
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   original_title  100 non-null    object        
 1   release_year    100 non-null    datetime64[ns]
 2   genre           100 non-null    object        
 3   duration        100 non-null    int64         
 4   country         100 non-null    object        
 5   content_rating  77 non-null     object        
 6   director        100 non-null    object        
 7   income          100 non-null    int64         
 8    votes          100 non-null    object        
 9   score           100 non-null    float64       
dtypes: datetime64[ns](1), float64(1), int64(2), object(6)
memory usage: 8.6+ KB


In [521]:
## Modify the votes column and change to an integer
movies[' votes '] = movies[' votes '].str.replace(',', '').str.replace('.', '', regex=False).astype(int)


In [522]:
movies

Unnamed: 0,original_title,release_year,genre,duration,country,content_rating,director,income,votes,score
0,The Shawshank Redemption,1995-01-01,Drama,142,USA,R,Frank Darabont,28815245,2278845,9.3
1,The Godfather,1972-01-01,Crime,175,USA,R,Francis Ford Coppola,246120974,1572674,9.2
2,The Dark Knight,2008-01-01,Action,152,USA,PG-13,Christopher Nolan,1005455211,2241615,9.0
3,The Godfather: Part II,1975-01-01,Crime,220,USA,R,Francis Ford Coppola,408035783,1098714,9.0
4,Pulp Fiction,1994-01-01,Crime,0,USA,R,Quentin Tarantino,222831817,1780147,8.9
...,...,...,...,...,...,...,...,...,...,...
96,The Sting,1974-01-01,Comedy,129,USA,PG,George Roy Hill,156000000,236285,7.5
97,Das Boot,1982-01-01,Adventure,149,Germany,R,Wolfgang Petersen,11487676,226427,7.5
98,Per qualche dollaro in piÃ¹,1965-01-01,Western,132,Italy,,Sergio Leone,15000000,226039,7.4
99,Jodaeiye Nader az Simin,2011-01-01,Drama,123,Iran,PG-13,Asghar Farhadi,22926076,214165,7.4


In [524]:
## checking content_rating column
movies.content_rating.value_counts()

R            45
PG-13        12
PG           11
G             6
Not Rated     1
Approved      1
Unrated       1
Name: content_rating, dtype: int64

In [525]:
## Replace Null values in content_rating with 'Unrated'
movies['content_rating'].fillna('Unrated', inplace=True) 

In [526]:
movies.content_rating.value_counts()

R            45
Unrated      24
PG-13        12
PG           11
G             6
Not Rated     1
Approved      1
Name: content_rating, dtype: int64

In [528]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 100 entries, 0 to 100
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   original_title  100 non-null    object        
 1   release_year    100 non-null    datetime64[ns]
 2   genre           100 non-null    object        
 3   duration        100 non-null    int64         
 4   country         100 non-null    object        
 5   content_rating  100 non-null    object        
 6   director        100 non-null    object        
 7   income          100 non-null    int64         
 8    votes          100 non-null    int64         
 9   score           100 non-null    float64       
dtypes: datetime64[ns](1), float64(1), int64(3), object(5)
memory usage: 8.6+ KB


In [530]:
## Checking the Final Table
movies

Unnamed: 0,original_title,release_year,genre,duration,country,content_rating,director,income,votes,score
0,The Shawshank Redemption,1995-01-01,Drama,142,USA,R,Frank Darabont,28815245,2278845,9.3
1,The Godfather,1972-01-01,Crime,175,USA,R,Francis Ford Coppola,246120974,1572674,9.2
2,The Dark Knight,2008-01-01,Action,152,USA,PG-13,Christopher Nolan,1005455211,2241615,9.0
3,The Godfather: Part II,1975-01-01,Crime,220,USA,R,Francis Ford Coppola,408035783,1098714,9.0
4,Pulp Fiction,1994-01-01,Crime,0,USA,R,Quentin Tarantino,222831817,1780147,8.9
...,...,...,...,...,...,...,...,...,...,...
96,The Sting,1974-01-01,Comedy,129,USA,PG,George Roy Hill,156000000,236285,7.5
97,Das Boot,1982-01-01,Adventure,149,Germany,R,Wolfgang Petersen,11487676,226427,7.5
98,Per qualche dollaro in piÃ¹,1965-01-01,Western,132,Italy,Unrated,Sergio Leone,15000000,226039,7.4
99,Jodaeiye Nader az Simin,2011-01-01,Drama,123,Iran,PG-13,Asghar Farhadi,22926076,214165,7.4


In [532]:
## Save cleaned data
movies.to_csv('Data_Wrangling_Movies_Cleaned.csv')