# Data Exploration with Pandas
<table><tr><td>
<img src="https://resizing.flixster.com/Grjhpv0wcwgi-uhfaC3QM8KFglY=/ems.cHJkLWVtcy1hc3NldHMvbW92aWVzLzdmOWE4MWFiLWVlOWMtNDA4Mi05OTA0LTRiNjMxNTEwMzk1MC5qcGc=" height=300><a href="https://resizing.flixster.com/Grjhpv0wcwgi-uhfaC3QM8KFglY=/ems.cHJkLWVtcy1hc3NldHMvbW92aWVzLzdmOWE4MWFiLWVlOWMtNDA4Mi05OTA0LTRiNjMxNTEwMzk1MC5qcGc=">source</a></td><td><img src="https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcRQCHfe9VV3K3Efxv5PYQ_6NYpB20WkKS1zW21UEUmhW1lalECnbwTH3nwQL8XprEMTUCtPeA" height=300><a href="https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcRQCHfe9VV3K3Efxv5PYQ_6NYpB20WkKS1zW21UEUmhW1lalECnbwTH3nwQL8XprEMTUCtPeA">source</a></td></tr></table>

In [2]:
# importing the package(s) we want to use
import pandas as pd

### Let's explore the movies data set more! ###
We'll use the `pd.read_csv()` function to read the csv file into a DataFrame.

In [4]:
csvFile = 'https://raw.githubusercontent.com/csbfx/advpy122-data/master/top_movies_2020.csv'

movies = pd.read_csv(csvFile)
movies.head()

Unnamed: 0,Title,Gross,Gross (Adjusted),Year
0,Gone with the Wind,200852579,1895421694,1939
1,Star Wars: Episode IV - A New Hope,460998507,1668979715,1977
2,The Sound of Music,159287539,1335086324,1965
3,E.T. the Extra-Terrestrial,435110554,1329174791,1982
4,Titanic,659363944,1270101626,1997


### Initial data exploration

We can examine the contents of the resultant DataFrame using the `head()` and `tail()` commands:

In [41]:
### Take a look at the top 3 values of the file
movies.head(3)

Unnamed: 0,Title,Gross,Gross (Adjusted),Year
0,Gone with the Wind,200852579,1895421694,1939
1,Star Wars: Episode IV - A New Hope,460998507,1668979715,1977
2,The Sound of Music,159287539,1335086324,1965


In [43]:
### How about the last 5 values of a the file?
movies.tail()

Unnamed: 0,Title,Gross,Gross (Adjusted),Year
195,Patton,61749765,373287682,1970
196,Fatal Attraction,156645693,371808159,1987
197,Iron Man 2,312433331,371691971,2010
198,Zootopia,341268248,371109157,2016
199,Liar Liar,181410615,370330510,1997


Use the info() method to get a quick description of the dataframe

In [45]:
### Get a quick summary of the data using the info() method
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Title             200 non-null    object
 1   Gross             200 non-null    int64 
 2   Gross (Adjusted)  200 non-null    int64 
 3   Year              200 non-null    int64 
dtypes: int64(3), object(1)
memory usage: 6.4+ KB


# Working to filter different elements

Using iloc and loc to extract specific rows and columns

In [53]:
### Get the first 10 elements (movies) of the second column and convert it into a series with a name
new_series = movies.iloc[:10,1]
new_series

0    200852579
1    460998507
2    159287539
3    435110554
4    659363944
5     65500000
6    260758300
7    111721910
8    232906145
9    184925486
Name: Gross, dtype: int64

In [63]:
### Get the names of the first 10 movies and convert it into a series and give it type string
movie_names = movies.iloc[:10,0].astype('string')
movie_names

0                    Gone with the Wind
1    Star Wars: Episode IV - A New Hope
2                    The Sound of Music
3            E.T. the Extra-Terrestrial
4                               Titanic
5                  The Ten Commandments
6                                  Jaws
7                        Doctor Zhivago
8                          The Exorcist
9       Snow White and the Seven Dwarfs
Name: Title, dtype: string

In [87]:
### Create a smaller dataframe with last 20 elements and all columns except 'Year'. Give the columns new custom names (your choice).
smaller_dataframe = movies.drop('Year', axis=1).tail(20)
smaller_dataframe.columns = ['a', 'b', 'c']
smaller_dataframe.tail()

Unnamed: 0,a,b,c
195,Patton,61749765,373287682
196,Fatal Attraction,156645693,371808159
197,Iron Man 2,312433331,371691971
198,Zootopia,341268248,371109157
199,Liar Liar,181410615,370330510


Subsetting the dataframe based on conditions

In [85]:
### Create a smaller dataframe with movies made in the year 2000s
new_smaller_dataframe = movies[(movies['Year'] >= 2000) & (movies['Year'] <=2009)]
new_smaller_dataframe.head()

Unnamed: 0,Title,Gross,Gross (Adjusted),Year
14,Avatar,760507625,911790952,2009
32,The Dark Knight,535234033,698121220,2008
37,Shrek 2,441226247,665746933,2004
38,Spider-Man,407022860,661768431,2002
52,Pirates of the Caribbean: Dead Man's Chest,423315812,605568108,2006


In [101]:
### How many movies Gross (use 'Gross (Adjusted)') over 1,500,000,000?
high_gross_movies = movies[(movies["Gross (Adjusted)"] > 1500000000)]
print(f'{len(high_gross_movies)} movies gross over 1,500,000,000')

2 movies gross over 1,500,000,000


Customizing the dataframe

In [105]:
### Use DataFrame.columns to change the column name to 'Movie','Gross', 'Gross_adj', and 'Year'
movies.columns = ["Movie", "Gross", "Gross_adj", "Year"]

In [111]:
### Set the 'Movies' column as the index
movies.set_index("Movie")

Unnamed: 0_level_0,Gross,Gross_adj,Year
Movie,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Gone with the Wind,200852579,1895421694,1939
Star Wars: Episode IV - A New Hope,460998507,1668979715,1977
The Sound of Music,159287539,1335086324,1965
E.T. the Extra-Terrestrial,435110554,1329174791,1982
Titanic,659363944,1270101626,1997
...,...,...,...
Patton,61749765,373287682,1970
Fatal Attraction,156645693,371808159,1987
Iron Man 2,312433331,371691971,2010
Zootopia,341268248,371109157,2016


Getting some statistic about the data

In [113]:
### Get some statistical information about the 'Gross' column
movies["Gross"].describe()

count    2.000000e+02
mean     2.564920e+08
std      1.705675e+08
min      9.183673e+06
25%      1.169264e+08
50%      2.341963e+08
75%      3.633033e+08
max      9.366622e+08
Name: Gross, dtype: float64

In [125]:
### What is the average 'Gross (Adjusted)' value for Movies from the 1990s?
movies_1990s = movies[(movies["Year"] >= 1990) & (movies["Year"] <=1999)]
movies_1990s["Gross_adj"].mean()

533060043.13793105

In [141]:
### What is the standard deviation of 'Gross (Adjusted)'?
print(f' the std dev of adjusted gross for all movies is {round(movies["Gross_adj"].std(),2)}')
print(f' the std dev of adjusted gross for movies from the 90s is {round(movies_1990s["Gross_adj"].std(), 2)}')

 the std dev of adjusted gross for all movies is 227797683.45
 the std dev of adjusted gross for movies from the 90s is 203516635.96
