# Exploratory Data Analysis with pandas Studio

## Data Analysis of IMDB movies data

#### Blocks have been created for your code

#### Click on the following link and download the dataset from [Kaggle](https://www.kaggle.com/PromptCloudHQ/imdb-data).

### 1. Read data

In [2]:
# import pandas
import pandas as pd
# create your data variabe using .read_csv 
df = pd.read_csv('IMDB-Movie-Data.csv')

# create your data_indexed variable here
index_variable = df.loc[:,'Rank']

print(df)


     Rank                    Title                     Genre  \
0       1  Guardians of the Galaxy   Action,Adventure,Sci-Fi   
1       2               Prometheus  Adventure,Mystery,Sci-Fi   
2       3                    Split           Horror,Thriller   
3       4                     Sing   Animation,Comedy,Family   
4       5            Suicide Squad  Action,Adventure,Fantasy   
..    ...                      ...                       ...   
995   996     Secret in Their Eyes       Crime,Drama,Mystery   
996   997          Hostel: Part II                    Horror   
997   998   Step Up 2: The Streets       Drama,Music,Romance   
998   999             Search Party          Adventure,Comedy   
999  1000               Nine Lives     Comedy,Family,Fantasy   

                                           Description              Director  \
0    A group of intergalactic criminals are forced ...            James Gunn   
1    Following clues to the origin of mankind, a te...          Ridley 

In [3]:
# If you are having trouble reading your file, try finding the path.  
    # Right click on your data set, on the pop-up menu select "Copy Path" then paste it into your read_csv() function.

In [4]:
# Read data with specified explicit index
df_index = pd.DataFrame(df, index=df.loc[:,'Rank'])

print(df_index)

        Rank                   Title                     Genre  \
Rank                                                             
1        2.0              Prometheus  Adventure,Mystery,Sci-Fi   
2        3.0                   Split           Horror,Thriller   
3        4.0                    Sing   Animation,Comedy,Family   
4        5.0           Suicide Squad  Action,Adventure,Fantasy   
5        6.0          The Great Wall  Action,Adventure,Fantasy   
...      ...                     ...                       ...   
996    997.0         Hostel: Part II                    Horror   
997    998.0  Step Up 2: The Streets       Drama,Music,Romance   
998    999.0            Search Party          Adventure,Comedy   
999   1000.0              Nine Lives     Comedy,Family,Fantasy   
1000     NaN                     NaN                       NaN   

                                            Description              Director  \
Rank                                                        

### 2. View Data
#### As we go through this section, write your own definitions of what each function does as a comment in the code box.

In [5]:
# Use the head() function to view the first 5 rows of the data set
df_first_5 = df_index.head(5)
print(df_first_5)

      Rank           Title                     Genre  \
Rank                                                   
1      2.0      Prometheus  Adventure,Mystery,Sci-Fi   
2      3.0           Split           Horror,Thriller   
3      4.0            Sing   Animation,Comedy,Family   
4      5.0   Suicide Squad  Action,Adventure,Fantasy   
5      6.0  The Great Wall  Action,Adventure,Fantasy   

                                            Description              Director  \
Rank                                                                            
1     Following clues to the origin of mankind, a te...          Ridley Scott   
2     Three girls are kidnapped by a man with a diag...    M. Night Shyamalan   
3     In a city of humanoid animals, a hustling thea...  Christophe Lourdelet   
4     A secret government agency recruits some of th...            David Ayer   
5     European mercenaries searching for black powde...           Yimou Zhang   

                                       

In [6]:
# Use the tail() function to view the last 5 rows of the data set
df_last_5 = df_index.tail(5)
print(df_last_5)

        Rank                   Title                  Genre  \
Rank                                                          
996    997.0         Hostel: Part II                 Horror   
997    998.0  Step Up 2: The Streets    Drama,Music,Romance   
998    999.0            Search Party       Adventure,Comedy   
999   1000.0              Nine Lives  Comedy,Family,Fantasy   
1000     NaN                     NaN                    NaN   

                                            Description          Director  \
Rank                                                                        
996   Three American college students studying abroa...          Eli Roth   
997   Romantic sparks occur between two dance studen...        Jon M. Chu   
998   A pair of friends embark on a mission to reuni...    Scot Armstrong   
999   A stuffy businessman finds himself trapped ins...  Barry Sonnenfeld   
1000                                                NaN               NaN   

                  

### 3. Understand basic information about the data
#### As we go through this section, write your own definitions of what each function does as a comment in the code box.

In [7]:
# The info() function: 

# Prints a summary of the DataFrame, including column data types, range of the index
# and non-null values.

print(df.info())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Rank                1000 non-null   int64  
 1   Title               1000 non-null   object 
 2   Genre               1000 non-null   object 
 3   Description         1000 non-null   object 
 4   Director            1000 non-null   object 
 5   Actors              1000 non-null   object 
 6   Year                1000 non-null   int64  
 7   Runtime (Minutes)   1000 non-null   int64  
 8   Rating              1000 non-null   float64
 9   Votes               1000 non-null   int64  
 10  Revenue (Millions)  872 non-null    float64
 11  Metascore           936 non-null    float64
dtypes: float64(3), int64(4), object(5)
memory usage: 93.9+ KB
None


In [8]:
# shape:
# Returns a tuple representing the dimensions of your data structure
df.shape

# what information does shape provide? no. of rows and columns
# what is the order? (rows, col)

(1000, 12)

In [9]:
# columns:
# Accesses the column labels of a DataFrame as a list.
df.columns
# what information does columns provide? list of column names. and the data type of the index

Index(['Rank', 'Title', 'Genre', 'Description', 'Director', 'Actors', 'Year',
       'Runtime (Minutes)', 'Rating', 'Votes', 'Revenue (Millions)',
       'Metascore'],
      dtype='object')

In [10]:
# The describe() function: 
# descriptive statistics of a DataFrame, such as mean, standard
# deviation, min, max, and quartile values.

df.describe()



Unnamed: 0,Rank,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
count,1000.0,1000.0,1000.0,1000.0,1000.0,872.0,936.0
mean,500.5,2012.783,113.172,6.7232,169808.3,82.956376,58.985043
std,288.819436,3.205962,18.810908,0.945429,188762.6,103.25354,17.194757
min,1.0,2006.0,66.0,1.9,61.0,0.0,11.0
25%,250.75,2010.0,100.0,6.2,36309.0,13.27,47.0
50%,500.5,2014.0,111.0,6.8,110799.0,47.985,59.5
75%,750.25,2016.0,123.0,7.4,239909.8,113.715,72.0
max,1000.0,2016.0,191.0,9.0,1791916.0,936.63,100.0


### 4. Data Selection -- Indexing and Slicing

In [14]:
#  Extract data as series
# define and print the variable 'genre'
genre = pd.Series(df.loc[:,"Genre"])
print(genre)


0       Action,Adventure,Sci-Fi
1      Adventure,Mystery,Sci-Fi
2               Horror,Thriller
3       Animation,Comedy,Family
4      Action,Adventure,Fantasy
                 ...           
995         Crime,Drama,Mystery
996                      Horror
997         Drama,Music,Romance
998            Adventure,Comedy
999       Comedy,Family,Fantasy
Name: Genre, Length: 1000, dtype: object


In [25]:
# Extract the data as a dataframe and print it
df_subset = df.iloc[:,[1,2,3]]
df_subset2 = df.loc[:,['Rating', 'Votes', 'Revenue (Millions)']]
print(df_subset)
print(df_subset2)


                       Title                     Genre  \
0    Guardians of the Galaxy   Action,Adventure,Sci-Fi   
1                 Prometheus  Adventure,Mystery,Sci-Fi   
2                      Split           Horror,Thriller   
3                       Sing   Animation,Comedy,Family   
4              Suicide Squad  Action,Adventure,Fantasy   
..                       ...                       ...   
995     Secret in Their Eyes       Crime,Drama,Mystery   
996          Hostel: Part II                    Horror   
997   Step Up 2: The Streets       Drama,Music,Romance   
998             Search Party          Adventure,Comedy   
999               Nine Lives     Comedy,Family,Fantasy   

                                           Description  
0    A group of intergalactic criminals are forced ...  
1    Following clues to the origin of mankind, a te...  
2    Three girls are kidnapped by a man with a diag...  
3    In a city of humanoid animals, a hustling thea...  
4    A secret gove

In [30]:
# Extract data using rows
# loc function
df.loc[:,'Title']


0      Guardians of the Galaxy
1                   Prometheus
2                        Split
3                         Sing
4                Suicide Squad
                ...           
995       Secret in Their Eyes
996            Hostel: Part II
997     Step Up 2: The Streets
998               Search Party
999                 Nine Lives
Name: Title, Length: 1000, dtype: object

In [27]:
# iloc function
df_index.iloc[4]


Rank                                                                6.0
Title                                                    The Great Wall
Genre                                          Action,Adventure,Fantasy
Description           European mercenaries searching for black powde...
Director                                                    Yimou Zhang
Actors                    Matt Damon, Tian Jing, Willem Dafoe, Andy Lau
Year                                                             2016.0
Runtime (Minutes)                                                 103.0
Rating                                                              6.1
Votes                                                           56036.0
Revenue (Millions)                                                45.13
Metascore                                                          42.0
Name: 5, dtype: object

### 5. Data Selection - Based on Conditional Filtering

In [52]:
# Use the code block to select movies that have been released between 2010-2016, with a rating of less than 6.0, 
# but generated a signifianct (top) revenue.
result_set = df[(df['Year']>=2010) & (df['Year']<=2016) & (df['Rating']<6.0)]
gchjgk=result_set.loc[:,'Revenue (Millions)']
# gchjgk.sort_values(by='Revenue (Millions)', ascending=False)
print(gchjgk)

24     103.14
27       0.01
29      54.65
34      26.84
42        NaN
        ...  
985     21.56
986     42.58
993     60.13
998       NaN
999     19.64
Name: Revenue (Millions), Length: 160, dtype: float64


In [117]:
# Note: The article is expecting more than 1 Twilight movie at the time of publishing.  
#To find the second Twilight movie, try manipulating the quantile value.  
#How specific can you get?