# Lab Test on manipulating data with Pandas

## Rules:

•	Make sure you are comfortable before the lab test starts – students leaving the lab during the test will not be allowed back in.

•	You must sit your assigned lab test – you will not be able to access any other lab test.

•	You must use the lab machines. You are not allowed to use any other devices, including your own laptop or phone. You should have no other device on the desktop during the test.

•	You may use paper. 

•	You may access Brightspace notes during the lab test, from the lab machine.

•	You must not communicate, verbally or electronically, with anyone other than the lab supervisor during the test.

•	Turn up at the start of the lab, log in to one of the desktops and start up Jupyter Notebook

•	When you have completed your test, check that you have answered all parts and submit.  Log off and leave the lab. 

•	Do not speak to anyone, either in the lab or in the corridor outside until the lab test is over.


In [1]:
import pandas as pd
import os
from os.path import join
from os import getcwd

In [2]:
datadir = join('Z:','Disney')

## 1.  Read data into dataframes from the directory datadir

### 1a Read disney-characters.csv into the dataframe chardf.

In [3]:
chardf = pd.read_csv("disney-characters.csv", sep=',', delimiter=None, encoding='UTF-8')

### 1b Read disney-director.csv into the dataframe dirdf

In [4]:
dirdf = pd.read_csv("disney-director.csv", sep=',', delimiter=None, encoding='UTF-8')

### 1c Read disney_movies_total_gross.csv into the dataframe mgrdf

In [5]:
mgrdf = pd.read_csv("disney_movies_total_gross.csv", sep=',', delimiter=None, encoding='UTF-8')

## 2.	Check and tidy the data

### 2a. Find the number of unique values for each of the columns in each of the dataframes.

In [6]:
chardf.nunique()
dirdf.nunique()
mgrdf.nunique()

index                       579
movie_title                 573
release_date                553
genre                        12
MPAA_rating                   5
total_gross                 576
inflation_adjusted_gross    576
dtype: int64

### 2b. Remove redundant '\n' from movie_title in chardf

In [7]:
chardf.movie_title = list(map(lambda x: x.replace("\n", ""), chardf.movie_title))
print(chardf.dtypes)

index            int64
movie_title     object
release_date    object
hero            object
villian         object
song            object
dtype: object


### 2c.	In both chardf and mgrdf, create a new column 'year'. Save the 4-digit year from release_date as a new integer column  'year'.  Drop the release_date column.

In [8]:
chardf['year'] = pd.to_datetime(chardf['release_date']).dt.year
chardf = chardf.drop('release_date', axis=1)

mgrdf['year'] = pd.to_datetime(mgrdf['release_date']).dt.year
mgrdf = mgrdf.drop('release_date', axis=1)

### 2d. Turn mgrdf.total_gross into an integer  (first remove commas and $)

In [9]:
mgrdf.total_gross = list(map(lambda x: x.replace(",", ""), mgrdf.total_gross))
mgrdf.total_gross = list(map(lambda x: x.replace("$", ""), mgrdf.total_gross))
mgrdf.total_gross = mgrdf.total_gross.astype(int)

### 2e. Ensure that movie_title and name in all tables is of type string.

In [10]:
for i in [mgrdf, chardf]:
    if i.movie_title.dtype != str:
        i.movie_title = i.movie_title.astype(str)

    print(i.dtypes, '\n')

if dirdf.name.dtype != str:
    dirdf.name = dirdf.name.astype(str)

print(dirdf.dtypes)

index                        int64
movie_title                 object
genre                       object
MPAA_rating                 object
total_gross                  int64
inflation_adjusted_gross    object
year                         int32
dtype: object 

index           int64
movie_title    object
hero           object
villian        object
song           object
year            int32
dtype: object 

index        int64
name        object
director    object
dtype: object


## 3. Merge  the data

### 3a  Merge chardf and dirdf on the index, keeping only indexes that match, saving the result in idxdf.

In [11]:
idxdf = pd.merge(chardf, dirdf, on='index')
idxdf.head(5)

Unnamed: 0,index,movie_title,hero,villian,song,year,name,director
0,0,Snow White and the Seven Dwarfs,Snow White,Evil Queen,Some Day My Prince Will Come,1937,Snow White and the Seven Dwarfs,David Hand
1,1,Pinocchio,Pinocchio,Stromboli,When You Wish upon a Star,1940,Pinocchio,Ben Sharpsteen
2,2,Fantasia,,Chernabog,,1940,Fantasia,full credits
3,3,Dumbo,Dumbo,Ringmaster,Baby Mine,1941,Dumbo,Ben Sharpsteen
4,4,Bambi,Bambi,Hunter,Love Is a Song,1942,Bambi,David Hand


### 3b Merge chardf and dirdf, using movie_title from chardf and name from dirdf , keeping only rows that match, saving the result in chddf.

In [12]:
chddf = pd.merge(chardf, dirdf, left_on='movie_title', right_on=dirdf.name)
chddf.head(5)

Unnamed: 0,index_x,movie_title,hero,villian,song,year,index_y,name,director
0,0,Snow White and the Seven Dwarfs,Snow White,Evil Queen,Some Day My Prince Will Come,1937,0,Snow White and the Seven Dwarfs,David Hand
1,1,Pinocchio,Pinocchio,Stromboli,When You Wish upon a Star,1940,1,Pinocchio,Ben Sharpsteen
2,2,Fantasia,,Chernabog,,1940,2,Fantasia,full credits
3,3,Dumbo,Dumbo,Ringmaster,Baby Mine,1941,3,Dumbo,Ben Sharpsteen
4,4,Bambi,Bambi,Hunter,Love Is a Song,1942,4,Bambi,David Hand


### 3c.	Merge mgrdf with idxdf using the movie_title and year from mgrdf, and the name and year from idxdf keeping financial information for all movies, saving the result in  chgrdf.

In [13]:
chgrdf = pd.merge(mgrdf, idxdf, left_on=(mgrdf.movie_title, mgrdf.year), right_on=(idxdf.name, idxdf.year))
chgrdf.head(5)

Unnamed: 0,key_0,key_1,index_x,movie_title_x,genre,MPAA_rating,total_gross,inflation_adjusted_gross,year_x,index_y,movie_title_y,hero,villian,song,year_y,name,director
0,Snow White and the Seven Dwarfs,1937,0,Snow White and the Seven Dwarfs,Musical,G,184925485,"$5,228,953,251",1937,0,Snow White and the Seven Dwarfs,Snow White,Evil Queen,Some Day My Prince Will Come,1937,Snow White and the Seven Dwarfs,David Hand
1,Pinocchio,1940,1,Pinocchio,Adventure,G,84300000,"$2,188,229,052",1940,1,Pinocchio,Pinocchio,Stromboli,When You Wish upon a Star,1940,Pinocchio,Ben Sharpsteen
2,Fantasia,1940,2,Fantasia,Musical,G,83320000,"$2,187,090,808",1940,2,Fantasia,,Chernabog,,1940,Fantasia,full credits
3,Cinderella,1950,4,Cinderella,Drama,G,85000000,"$920,608,730",1950,11,Cinderella,Cinderella,Lady Tremaine,Bibbidi-Bobbidi-Boo,1950,Cinderella,Wilfred Jackson
4,Lady and the Tramp,1955,6,Lady and the Tramp,Drama,G,93600000,"$1,236,035,515",1955,14,Lady and the Tramp,Lady and Tramp,Si and Am,Bella Notte,1955,Lady and the Tramp,Hamilton Luske


## 4.	Queries: 

### 4a.	Using chgrdf, show each genre and the number of movies from that genre.

In [26]:
chgrdf['genre'].value_counts()
chgrdf.columns

Index(['key_0', 'key_1', 'index_x', 'movie_title_x', 'genre', 'MPAA_rating',
       'total_gross', 'inflation_adjusted_gross', 'year_x', 'index_y',
       'movie_title_y', 'hero', 'villian', 'song', 'year_y', 'name',
       'director'],
      dtype='object')

### 4b.	Using chgrdf, for each row where movie_title_x is not equal to movie_title_y, display movie_title_x, movie_title_y, year, genre, director and total_gross.

In [25]:
chgrdf[chgrdf['movie_title_x'] != chgrdf['movie_title_y']][['movie_title_x', 'movie_title_y', 'year', 'genre', 'director', 'total_gross']]

KeyError: "['year'] not in index"

### 4c.	Using chgrdf, for each row where movie_title_x or movie_title_y contain the string 'Dalmatian', display movie_title_x, movie_title_y, year, genre, director and total_gross.

## 5.	What did you learn? Use Markdown or comments to enter your answer.
### 5a.	Explain the difference between idxdf and chddf.

### 5b.	Based on your experience of this data, suggest two ways in which chgrdf could be tidied, to enhance analysis.

# BEFORE YOU SUBMIT:
Restart and run all, to make sure you have no errors.