# Lab Test on manipulating data with Pandas

## Rules:

•	Make sure you are comfortable before the lab test starts – students leaving the lab during the test will not be allowed back in.

•	You must sit your assigned lab test – you will not be able to access any other lab test.

•	You must use the lab machines. You are not allowed to use any other devices, including your own laptop or phone. You should have no other device on the desktop during the test.

•	You may use paper. 

•	You may access Brightspace notes during the lab test, from the lab machine.

•	You must not communicate, verbally or electronically, with anyone other than the lab supervisor during the test.

•	Turn up at the start of the lab, log in to one of the desktops and start up Jupyter Notebook

•	When you have completed your test, check that you have answered all parts and submit.  Log off and leave the lab. 

•	Do not speak to anyone, either in the lab or in the corridor outside until the lab test is over.


In [3]:
import pandas as pd
import os
from os.path import join
from os import getcwd

In [4]:
datadir = join('Z:','Disney')

## 1.  Read data into dataframes from the directory datadir

### 1a Read disney-characters.csv into the dataframe chardf.

In [5]:
chardf = pd.read_csv('disney-characters.csv')

### 1b Read disney-director.csv into the dataframe dirdf

In [6]:
dirdf = pd.read_csv('disney-director.csv')

### 1c Read disney_movies_total_gross.csv into the dataframe mgrdf

In [7]:
mgrdf = pd.read_csv('disney_movies_total_gross.csv')

## 2.	Check and tidy the data

### 2a. Find the number of unique values for each of the columns in each of the dataframes.

In [8]:
print(chardf.nunique(), '\n')
print(dirdf.nunique(), '\n')
print(mgrdf.nunique(), '\n')

index           56
movie_title     56
release_date    56
hero            49
villian         46
song            47
dtype: int64 

index       56
name        56
director    29
dtype: int64 

index                       579
movie_title                 573
release_date                553
genre                        12
MPAA_rating                   5
total_gross                 576
inflation_adjusted_gross    576
dtype: int64 



### 2b. Remove redundant '\n' from movie_title in chardf

In [9]:
chardf.movie_title = list(map(lambda x: x.replace("\n", ""), chardf.movie_title))
chardf['movie_title'].head()

0    \rSnow White and the Seven Dwarfs
1                          \rPinocchio
2                           \rFantasia
3                                Dumbo
4                              \rBambi
Name: movie_title, dtype: object

### 2c.	In both chardf and mgrdf, create a new column 'year'. Save the 4-digit year from release_date as a new integer column  'year'.  Drop the release_date column.

In [10]:
chardf['year'] = chardf['release_date'].str[-4:].astype(int)
chardf.drop(columns='release_date', inplace=True)

mgrdf['year'] = mgrdf['release_date'].str[-4:].astype(int)
mgrdf.drop(columns='release_date', inplace=True)

### 2d. Turn mgrdf.total_gross into an integer  (first remove commas and $)

In [11]:
mgrdf.total_gross = list(map(lambda x: x.replace("$", ""), mgrdf.total_gross))
mgrdf.total_gross = list(map(lambda x: x.replace(",", ""), mgrdf.total_gross))
mgrdf.total_gross = mgrdf.total_gross.astype(int)

### 2e. Ensure that movie_title and name in all tables is of type string.

In [12]:
chardf.movie_title = chardf.movie_title.astype(str)
dirdf.name = dirdf.name.astype(str)
mgrdf.movie_title = mgrdf.movie_title.astype(str)

## 3. Merge  the data

### 3a  Merge chardf and dirdf on the index, keeping only indexes that match, saving the result in idxdf.

In [13]:
idxdf = pd.merge(chardf, dirdf, on="index")
idxdf.head()

Unnamed: 0,index,movie_title,hero,villian,song,year,name,director
0,0,\rSnow White and the Seven Dwarfs,Snow White,Evil Queen,Some Day My Prince Will Come,1937,Snow White and the Seven Dwarfs,David Hand
1,1,\rPinocchio,Pinocchio,Stromboli,When You Wish upon a Star,1940,Pinocchio,Ben Sharpsteen
2,2,\rFantasia,,Chernabog,,1940,Fantasia,full credits
3,3,Dumbo,Dumbo,Ringmaster,Baby Mine,1941,Dumbo,Ben Sharpsteen
4,4,\rBambi,Bambi,Hunter,Love Is a Song,1942,Bambi,David Hand


### 3b Merge chardf and dirdf, using movie_title from chardf and name from dirdf , keeping only rows that match, saving the result in chddf.

In [14]:
chddf = pd.merge(chardf, dirdf, left_on="movie_title", right_on="name")

### 3c.	Merge mgrdf with idxdf using the movie_title and year from mgrdf, and the name and year from idxdf keeping financial information for all movies, saving the result in  chgrdf.

In [15]:
mgrdf.head()

Unnamed: 0,index,movie_title,genre,MPAA_rating,total_gross,inflation_adjusted_gross,year
0,0,Snow White and the Seven Dwarfs,Musical,G,184925485,"$5,228,953,251",1937
1,1,Pinocchio,Adventure,G,84300000,"$2,188,229,052",1940
2,2,Fantasia,Musical,G,83320000,"$2,187,090,808",1940
3,3,Song of the South,Adventure,G,65000000,"$1,078,510,579",1946
4,4,Cinderella,Drama,G,85000000,"$920,608,730",1950


In [16]:
idxdf.head()

Unnamed: 0,index,movie_title,hero,villian,song,year,name,director
0,0,\rSnow White and the Seven Dwarfs,Snow White,Evil Queen,Some Day My Prince Will Come,1937,Snow White and the Seven Dwarfs,David Hand
1,1,\rPinocchio,Pinocchio,Stromboli,When You Wish upon a Star,1940,Pinocchio,Ben Sharpsteen
2,2,\rFantasia,,Chernabog,,1940,Fantasia,full credits
3,3,Dumbo,Dumbo,Ringmaster,Baby Mine,1941,Dumbo,Ben Sharpsteen
4,4,\rBambi,Bambi,Hunter,Love Is a Song,1942,Bambi,David Hand


In [17]:
chgrdf = pd.merge(mgrdf, idxdf, left_on=('movie_title', 'year'), right_on=('name', 'year'))

## 4.	Queries: 

### 4a.	Using chgrdf, show each genre and the number of movies from that genre.

In [18]:
chgrdf['genre'].value_counts()


genre
Adventure    30
Musical       5
Comedy        5
Drama         3
Name: count, dtype: int64

### 4b.	Using chgrdf, for each row where movie_title_x is not equal to movie_title_y, display movie_title_x, movie_title_y, year, genre, director and total_gross.

In [19]:
filter_df = chgrdf.query('movie_title_x != movie_title_y')
filter_df[['movie_title_x', 'movie_title_y', 'year', 'genre', 'director', 'total_gross']]

Unnamed: 0,movie_title_x,movie_title_y,year,genre,director,total_gross
0,Snow White and the Seven Dwarfs,\rSnow White and the Seven Dwarfs,1937,Musical,David Hand,184925485
1,Pinocchio,\rPinocchio,1940,Adventure,Ben Sharpsteen,84300000
2,Fantasia,\rFantasia,1940,Musical,full credits,83320000
4,Lady and the Tramp,\rLady and the Tramp,1955,Drama,Hamilton Luske,93600000
6,101 Dalmatians,\rOne Hundred and One Dalmatians,1961,Comedy,Wolfgang Reitherman,153000000
7,The Sword in the Stone,\rThe Sword in the Stone,1963,Adventure,Wolfgang Reitherman,22182353
10,The Many Adventures of Winnie the Pooh,\rThe Many Adventures of Winnie the Pooh,1977,,Wolfgang Reitherman,0
11,The Rescuers,\rThe Rescuers,1977,Adventure,Wolfgang Reitherman,48775599
12,The Fox and the Hound,\rThe Fox and the Hound,1981,Comedy,Art Stevens,43899231
13,The Black Cauldron,\rThe Black Cauldron,1985,Adventure,Ted Berman,21288692


### 4c.	Using chgrdf, for each row where movie_title_x or movie_title_y contain the string 'Dalmatian', display movie_title_x, movie_title_y, year, genre, director and total_gross.

In [20]:
filter_df = chgrdf.query('movie_title_x.str.contains("Dalmatian") or movie_title_y.str.contains("Dalmatian")')
filter_df[['movie_title_x', 'movie_title_y', 'year', 'genre', 'director', 'total_gross']]

Unnamed: 0,movie_title_x,movie_title_y,year,genre,director,total_gross
6,101 Dalmatians,\rOne Hundred and One Dalmatians,1961,Comedy,Wolfgang Reitherman,153000000


## 5.	What did you learn? Use Markdown or comments to enter your answer.
### 5a.	Explain the difference between idxdf and chddf.

idxdf merges the data using only the indexes of the respective dataframes. chddf instead merges the data using a key value pair, being ({name of movie}, {year of movie}).

They both merge data from the same datasets, but using different methods.

### 5b.	Based on your experience of this data, suggest two ways in which chgrdf could be tidied, to enhance analysis.

In [21]:
chgrdf.head(10)

Unnamed: 0,index_x,movie_title_x,genre,MPAA_rating,total_gross,inflation_adjusted_gross,year,index_y,movie_title_y,hero,villian,song,name,director
0,0,Snow White and the Seven Dwarfs,Musical,G,184925485,"$5,228,953,251",1937,0,\rSnow White and the Seven Dwarfs,Snow White,Evil Queen,Some Day My Prince Will Come,Snow White and the Seven Dwarfs,David Hand
1,1,Pinocchio,Adventure,G,84300000,"$2,188,229,052",1940,1,\rPinocchio,Pinocchio,Stromboli,When You Wish upon a Star,Pinocchio,Ben Sharpsteen
2,2,Fantasia,Musical,G,83320000,"$2,187,090,808",1940,2,\rFantasia,,Chernabog,,Fantasia,full credits
3,4,Cinderella,Drama,G,85000000,"$920,608,730",1950,11,Cinderella,Cinderella,Lady Tremaine,Bibbidi-Bobbidi-Boo,Cinderella,Wilfred Jackson
4,6,Lady and the Tramp,Drama,G,93600000,"$1,236,035,515",1955,14,\rLady and the Tramp,Lady and Tramp,Si and Am,Bella Notte,Lady and the Tramp,Hamilton Luske
5,7,Sleeping Beauty,Drama,,9464608,"$21,505,832",1959,15,Sleeping Beauty,Aurora,Maleficent,Once Upon a Dream,Sleeping Beauty,Clyde Geronimi
6,8,101 Dalmatians,Comedy,G,153000000,"$1,362,870,985",1961,16,\rOne Hundred and One Dalmatians,Pongo,Cruella de Vil,Cruella De Vil,101 Dalmatians,Wolfgang Reitherman
7,12,The Sword in the Stone,Adventure,,22182353,"$153,870,834",1963,17,\rThe Sword in the Stone,Arthur,Madam Mim,Higitus Figitus\r\n,The Sword in the Stone,Wolfgang Reitherman
8,13,The Jungle Book,Musical,Not Rated,141843000,"$789,612,346",1967,18,The Jungle Book,Mowgli,Kaa and Shere Khan,The Bare Necessities\r\n,The Jungle Book,Wolfgang Reitherman
9,15,The Aristocats,Musical,G,55675257,"$255,161,499",1970,19,The Aristocats,Thomas and Duchess,Edgar Balthazar,Ev'rybody Wants to Be a Cat,The Aristocats,Wolfgang Reitherman


*Run the code above for visualisation.*

1. chgrdf doesn't really need to keep index_x or index_y, since pandas will give each row its own index.
2. MPAA_Rating adds no value in this context and can be removed.
3. inflation_adjusted_gross should have its $ and , removed.

# BEFORE YOU SUBMIT:
Restart and run all, to make sure you have no errors.