# Project 4: Merging & Cleaning & Transforming Data (Movies Dataset)

# Project Brief for Self-Coders

Here you´ll have the opportunity to code major parts of Project 4 on your own. If you need any help or inspiration, have a look at the Videos or the Jupyter Notebook with the full code. <br> <br>
Keep in mind that it´s all about __getting the right results/conclusions__. It´s not about finding the identical code. Things can be coded in many different ways. Even if you come to the same conclusions, it´s very unlikely that we have the very same code. 

## Introduction / Getting the Datasets

1. __Load__ and __inspect__ the datasets "movies_clean.csv" and "credits.csv". __Identify__ stringified/nested __json columns__ in the __credits__ dataset.

In [2]:
import pandas as pd
import numpy as np
import ast


In [3]:
movies_df= pd.read_csv("movies_clean.csv",  low_memory=False)
credit_df = pd.read_csv("credits.csv",  low_memory=False)

### Inspecting dataframes 

In [4]:
movies_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44691 entries, 0 to 44690
Data columns (total 18 columns):
id                       44691 non-null int64
title                    44691 non-null object
tagline                  20284 non-null object
release_date             44657 non-null object
genres                   42586 non-null object
belongs_to_collection    4463 non-null object
original_language        44681 non-null object
budget_musd              8854 non-null float64
revenue_musd             7385 non-null float64
production_companies     33356 non-null object
production_countries     38835 non-null object
vote_count               44691 non-null float64
vote_average             42077 non-null float64
popularity               44691 non-null float64
runtime                  43179 non-null float64
overview                 43740 non-null object
spoken_languages         41094 non-null object
poster_path              44467 non-null object
dtypes: float64(6), int64(1), object(11)
me

In [5]:
credit_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45476 entries, 0 to 45475
Data columns (total 3 columns):
cast    45476 non-null object
crew    45476 non-null object
id      45476 non-null int64
dtypes: int64(1), object(2)
memory usage: 1.0+ MB


In [6]:
#credit_df.crew[0]

## Preparing the Data for Merge

2. __Drop Duplicates__ in the credits datasets. (similar to Project 3)

In [7]:
 credit_df.drop_duplicates(subset = 'id', inplace = True)

### checking if there are duplicates 

In [8]:
#credit_df['id'].value_counts()

## Merging the Data

3. __Merge/Join__ the datasets movies_clean and credits. -> Add the features __cast__ and __crew__ to the movies_clean dataset.

<h3> Performing a Joint operation

In [9]:
df = movies_df.merge(credit_df, how = 'left', 
                     left_on = 'id', right_on = 'id')

<h3> Inspecting the joint dataframe

In [10]:
df.head(2)

Unnamed: 0,id,title,tagline,release_date,genres,belongs_to_collection,original_language,budget_musd,revenue_musd,production_companies,production_countries,vote_count,vote_average,popularity,runtime,overview,spoken_languages,poster_path,cast,crew
0,862,Toy Story,,1995-10-30,Animation|Comedy|Family,Toy Story Collection,en,30.0,373.554033,Pixar Animation Studios,United States of America,5415.0,7.7,21.946943,81.0,"Led by Woody, Andy's toys live happily in his ...",English,<img src='http://image.tmdb.org/t/p/w185//rhIR...,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de..."
1,8844,Jumanji,Roll the dice and unleash the excitement!,1995-12-15,Adventure|Fantasy|Family,,en,65.0,262.797249,TriStar Pictures|Teitler Film|Interscope Commu...,United States of America,2413.0,6.9,17.015539,104.0,When siblings Judy and Peter discover an encha...,English|Français,<img src='http://image.tmdb.org/t/p/w185//vzmL...,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de..."


## Cleaning and Transforming the new "Cast" Column

4.  __Evaluate__ Python Expressions in the stringified column "cast" and __remove quotes__ ("") where possible.

<h3> Overriding the cast column containing the json strings

In [11]:
df.cast = df.cast.apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else np.nan)

5. __Determine__ the __cast size__ for all movies (number of actors) and add the additional column "cast_size".

<h3> Adding additional column cast_size

In [12]:
df['cast_size']  = df.cast.apply(lambda x: len(x))

In [13]:
df.cast_size.value_counts(dropna = False).head(5)

10    2770
8     2729
7     2710
6     2649
5     2637
Name: cast_size, dtype: int64

6. __Extract__ all __actor names__ from the column "cast" and __overwrite__ "cast". If a movie has more than one actor, __seperate names by a pipe__ "|".<br>
For example: The value in the first row (Toy Story) should be 'Tom Hanks|Tim Allen|Don Rickles|Jim Varney|Wallace Shawn|John Ratzenberger|Annie Potts|John Morris|Erik von Detten|Laurie Metcalf|R. Lee Ermey|Sarah Freeman|Penn Jillette'.

In [14]:
df.cast = df.cast.apply(lambda x: '|'.join(i['name'] for i in x) if isinstance(x, list) else np.nan)

In [15]:
df.cast[1]

'Robin Williams|Jonathan Hyde|Kirsten Dunst|Bradley Pierce|Bonnie Hunt|Bebe Neuwirth|David Alan Grier|Patricia Clarkson|Adam Hann-Byrd|Laura Bell Bundy|James Handy|Gillian Barber|Brandon Obray|Cyrus Thiedeke|Gary Joseph Thorup|Leonard Zola|Lloyd Berry|Malcolm Stewart|Annabel Kershaw|Darryl Henriques|Robyn Driscoll|Peter Bryant|Sarah Gilson|Florica Vlad|June Lion|Brenda Lockmuller'

7. __Inspect__ cast with value_counts(). Do you see anything strange? __Take reasonable measures__!

In [16]:
df.cast.value_counts().head(10)

                      2189
Georges Méliès          24
Louis Theroux           15
Mel Blanc               12
Jimmy Carr               9
George Carlin            8
Werner Herzog            8
David Attenborough       8
Louis C.K.               8
Doug Stanhope            6
Name: cast, dtype: int64

<h1 style="color: green"> There are empty values with the data

In [17]:
df.cast.replace('', np.nan, inplace = True)

In [18]:
df.cast.value_counts()

Georges Méliès                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         24
Louis Theroux                                 

## Cleaning and Transforming the new "Crew" Column

8.  __Evaluate__ Python Expressions in the stringified column "crew" and __remove quotes__ ("") where possible.

In [19]:
df.crew = df.crew.apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else np.nan)

In [20]:
df.crew[0]

[{'credit_id': '52fe4284c3a36847f8024f49',
  'department': 'Directing',
  'gender': 2,
  'id': 7879,
  'job': 'Director',
  'name': 'John Lasseter',
  'profile_path': '/7EdqiNbr4FRjIhKHyPPdFfEEEFG.jpg'},
 {'credit_id': '52fe4284c3a36847f8024f4f',
  'department': 'Writing',
  'gender': 2,
  'id': 12891,
  'job': 'Screenplay',
  'name': 'Joss Whedon',
  'profile_path': '/dTiVsuaTVTeGmvkhcyJvKp2A5kr.jpg'},
 {'credit_id': '52fe4284c3a36847f8024f55',
  'department': 'Writing',
  'gender': 2,
  'id': 7,
  'job': 'Screenplay',
  'name': 'Andrew Stanton',
  'profile_path': '/pvQWsu0qc8JFQhMVJkTHuexUAa1.jpg'},
 {'credit_id': '52fe4284c3a36847f8024f5b',
  'department': 'Writing',
  'gender': 2,
  'id': 12892,
  'job': 'Screenplay',
  'name': 'Joel Cohen',
  'profile_path': '/dAubAiZcvKFbboWlj7oXOkZnTSu.jpg'},
 {'credit_id': '52fe4284c3a36847f8024f61',
  'department': 'Writing',
  'gender': 0,
  'id': 12893,
  'job': 'Screenplay',
  'name': 'Alec Sokolow',
  'profile_path': '/v79vlRYi94BZUQnkkyzn

9. __Determine__ the __crew size__ for all movies (size of the crew) and add the additional column "crew_size".

In [21]:
df['crew_size'] = df.crew.apply(lambda x: len(x))

In [23]:
df.crew_size.value_counts(dropna = False).head(10)

2     6197
3     4951
1     4866
4     3064
5     2240
7     1968
10    1928
6     1867
8     1864
9     1686
Name: crew_size, dtype: int64

10. __Extract__ the __director name__ from the column "crew" and create the new column "director". <br> For example: The value in the first row (Toy Story) should be 'John Lasseter'.

In [24]:
def director(x):
    for i in x:
        if i['job']== 'Director':
            return i['name']
    return np.nan

In [25]:
df['director'] = df.crew.apply(director)

In [26]:
df.director.value_counts().head(5)

John Ford           66
Michael Curtiz      65
Werner Herzog       54
Alfred Hitchcock    53
Georges Méliès      49
Name: director, dtype: int64

## Final Steps

11. __Drop__ the column "crew" and __save__ the dataset in a csv-file.

In [27]:
df.drop(columns = ['crew'], inplace =True)

<h3> Checking if there is crew 

In [28]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 44691 entries, 0 to 44690
Data columns (total 22 columns):
id                       44691 non-null int64
title                    44691 non-null object
tagline                  20284 non-null object
release_date             44657 non-null object
genres                   42586 non-null object
belongs_to_collection    4463 non-null object
original_language        44681 non-null object
budget_musd              8854 non-null float64
revenue_musd             7385 non-null float64
production_companies     33356 non-null object
production_countries     38835 non-null object
vote_count               44691 non-null float64
vote_average             42077 non-null float64
popularity               44691 non-null float64
runtime                  43179 non-null float64
overview                 43740 non-null object
spoken_languages         41094 non-null object
poster_path              44467 non-null object
cast                     42502 non-null obj

# +++++++++ See some Hints below +++++++++++++

# ++++++++++++++++ Hints++++++++++++++++++++

__Hints for 2.__<br>
There cannot be two or more movies with the same movie id.

__Hints for 3.__<br>
You can use a left join with movies_clean as left dataset and credits as right dataset.

__Hints for 4.__<br>
This is very similar to Question 3 in Project 3.

__Hints for 5.__<br> 
apply an appropriate lambda function on all column elements.

__Hints for 6.__<br>
This is very similar to Questions 4-8 in Project 3.

__Hints for 7.__<br>
This is very similar to Question 9 in Project 3.

__Hints for 10.__<br> 
apply an appropriate user-defined function (a bit more complex) on all column elements.