In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
movie_ratings = pd.read_csv('./datasets/movie_ratings.csv')
movie_ratings.head()

Unnamed: 0,Title,US Gross,Worldwide Gross,Production Budget,Release Date,MPAA Rating,Source,Major Genre,Creative Type,IMDB Rating,IMDB Votes
0,Opal Dreams,14443,14443,9000000,Nov 22 2006,PG/PG-13,Adapted screenplay,Drama,Fiction,6.5,468
1,Major Dundee,14873,14873,3800000,Apr 07 1965,PG/PG-13,Adapted screenplay,Western/Musical,Fiction,6.7,2588
2,The Informers,315000,315000,18000000,Apr 24 2009,R,Adapted screenplay,Horror/Thriller,Fiction,5.2,7595
3,Buffalo Soldiers,353743,353743,15000000,Jul 25 2003,R,Adapted screenplay,Comedy,Fiction,6.9,13510
4,The Last Sin Eater,388390,388390,2200000,Feb 09 2007,PG/PG-13,Adapted screenplay,Drama,Fiction,5.7,1012


## DataType and DataType Conversion

The `dtypes` property is used to find the dtypes in the DataFrame.

In [None]:
movie_ratings.dtypes

Title                 object
US Gross               int64
Worldwide Gross        int64
Production Budget      int64
Release Date          object
MPAA Rating           object
Source                object
Major Genre           object
Creative Type         object
IMDB Rating          float64
IMDB Votes             int64
dtype: object

The `dtypes`attribute returns a Series with the data type of each column.

<img src="https://www.w3resource.com/w3r_images/pandas-dataframe-dtypes-1.png" width="500">

While it's common for columns containing strings to have the `object` data type, it can also include other types such as lists, dictionaries, or even mixed types within the same column. The `object` data type is a catch-all for columns that contain mixed types or types that aren't easily categorized.

### Data Type Filtering
We can filter the columns based on its data types

In [None]:

# select just object columns
movie_ratings.select_dtypes(include='object').head()


Unnamed: 0,Title,MPAA Rating,Source,Major Genre,Creative Type
0,Opal Dreams,PG/PG-13,Adapted screenplay,Drama,Fiction
1,Major Dundee,PG/PG-13,Adapted screenplay,Western/Musical,Fiction
2,The Informers,R,Adapted screenplay,Horror/Thriller,Fiction
3,Buffalo Soldiers,R,Adapted screenplay,Comedy,Fiction
4,The Last Sin Eater,PG/PG-13,Adapted screenplay,Drama,Fiction


In [None]:
# select the numeric columns
movie_ratings.select_dtypes(include='number').head()

Unnamed: 0,US Gross,Worldwide Gross,Production Budget,IMDB Rating,IMDB Votes,Release Year,ratio_wgross_by_budget
0,14443,14443,9000000,6.5,468,2006,0.001605
1,14873,14873,3800000,6.7,2588,1965,0.003914
2,315000,315000,18000000,5.2,7595,2009,0.0175
3,353743,353743,15000000,6.9,13510,2003,0.023583
4,388390,388390,2200000,5.7,1012,2007,0.176541


### Available Data Types and Associated Built-in Functions

In a DataFrame, columns can have different data types. Here are the common data types you'll encounter and some built-in functions associated with each type:

1. **Numerical Data (int, float)**
   - Built-in functions: `mean()`, `sum()`, `min()`, `max()`, `std()`, `median()`, `quantile()`, etc.

2. **Object Data (str or mixed types)**
   - Built-in functions: `str.contains()`, `str.startswith()`, `str.endswith()`, `str.lower()`, `str.upper()`, `str.replace()`, etc.

3. **Datetime Data (datetime64)**
   - Built-in functions: `dt.year`, `dt.month`, `dt.day`, `dt.strftime()`, `dt.weekday()`, `dt.hour`, etc.

These functions help in exploring and transforming the data effectively depending on the type of data in each column.

###  Data Type Conversion
  
   When you work on a specific column, being mindful of which data type it is, the data type depends on its built in function.

   Often, we need to convert the datatypes of some of the columns to make them suitable for analysis. For example, the datatype of Release Date in the DataFrame movie_ratings is object. To perform datetime related computations on this variable, we’ll need to convert it to a datatime format. We’ll use the Pandas function `to_datatime()` to covert it to a datatime format. Similar functions such as `to_numeric()`, `to_string()` etc., can be used for other conversions.

In [None]:
movie_ratings['Release Date']

0       Nov 22 2006
1       Apr 07 1965
2       Apr 24 2009
3       Jul 25 2003
4       Feb 09 2007
           ...     
2223    Jul 07 2004
2224    Jun 19 1998
2225    May 14 2010
2226    Jun 14 1991
2227    Jan 23 1998
Name: Release Date, Length: 2228, dtype: object

In [None]:
# check the datatype of release data column 
movie_ratings['Release Date'].dtypes

dtype('O')

We can see above that the function `to_datetime()` converts Release Date to a `datetime` format.

Next, we’ll update the variable `Release Date` in the DataFrame to be in the `datetime` format:

In [None]:
movie_ratings['Release Date'] = pd.to_datetime(movie_ratings['Release Date'])

In [None]:
# Let's check the datatype of release data column again
movie_ratings['Release Date'].dtypes

dtype('<M8[ns]')

`dtype('<M8[ns]')` means a 64-bit datetime object with nanosecond precision stored in little-endian format. This data type is commonly used to represent timestamps in high-resolution time series data.

Next, we can use the built-in datetime functions to extract the year from this variable and create the 'release year' column.

In [None]:
# Extracting the year from the release date
movie_ratings['Release Year'] = movie_ratings['Release Date'].dt.year
movie_ratings.head()

Unnamed: 0,Title,US Gross,Worldwide Gross,Production Budget,Release Date,MPAA Rating,Source,Major Genre,Creative Type,IMDB Rating,IMDB Votes,Release Year,ratio_wgross_by_budget
0,Opal Dreams,14443,14443,9000000,2006-11-22,PG/PG-13,Adapted screenplay,Drama,Fiction,6.5,468,2006,0.001605
1,Major Dundee,14873,14873,3800000,1965-04-07,PG/PG-13,Adapted screenplay,Western/Musical,Fiction,6.7,2588,1965,0.003914
2,The Informers,315000,315000,18000000,2009-04-24,R,Adapted screenplay,Horror/Thriller,Fiction,5.2,7595,2009,0.0175
3,Buffalo Soldiers,353743,353743,15000000,2003-07-25,R,Adapted screenplay,Comedy,Fiction,6.9,13510,2003,0.023583
4,The Last Sin Eater,388390,388390,2200000,2007-02-09,PG/PG-13,Adapted screenplay,Drama,Fiction,5.7,1012,2007,0.176541


In Pandas, the `errors='coerce'` parameter is often used in the context of data conversion, specifically when using the `pd.to_numeric` function. This argument tells Pandas to convert values that it can and set the ones it cannot convert to `NaN`. It's a way of gracefully handling errors without raising an exception. Read the textbook for an example

## Numeric Columns 

### Summary statistics across rows/columns in Pandas

The Pandas DataFrame class has functions such as `sum()` and `mean()` to compute sum over rows or columns of a DataFrame.

By default, functions like `mean()` and `sum()` compute the statistics for each column (i.e., all rows are aggregated) in the DataFrame.
Let us compute the mean of all the numeric columns of the data:

In [None]:
movie_ratings.describe()

Unnamed: 0,US Gross,Worldwide Gross,Production Budget,IMDB Rating,IMDB Votes
count,2228.0,2228.0,2228.0,2228.0,2228.0
mean,50763700.0,101937000.0,38160550.0,6.239004,33585.154847
std,66430810.0,164858900.0,37826040.0,1.243285,47325.651561
min,0.0,884.0,218.0,1.4,18.0
25%,9646188.0,13207370.0,12000000.0,5.5,6659.25
50%,28386490.0,42668920.0,26000000.0,6.4,18169.0
75%,64531400.0,120000000.0,53000000.0,7.1,40092.75
max,760167600.0,2767891000.0,300000000.0,9.2,519541.0


In [None]:
# select the numeric columns
movie_ratings.mean(numeric_only=True)

US Gross                  5.076370e+07
Worldwide Gross           1.019370e+08
Production Budget         3.816055e+07
IMDB Rating               6.239004e+00
IMDB Votes                3.358515e+04
Release Year              2.002005e+03
ratio_wgross_by_budget    1.259483e+01
dtype: float64

 **Using the `axis` parameter**:

The `axis` parameter controls whether to compute the statistic across rows or columns:
* The argument `axis=0`(deafult) denotes that the mean is taken over all the rows of the DataFrame. 
* For computing a statistic across column the argument `axis=1` will be used.

If mean over a subset of columns is desired, then those column names can be subset from the data. 

For example, let us compute the mean IMDB rating, and mean IMDB votes of all the movies:


In [None]:
movie_ratings[['IMDB Rating','IMDB Votes']].mean(axis = 0)

IMDB Rating        6.239004
IMDB Votes     33585.154847
dtype: float64

 **Pandas `sum`  function**

In [None]:
data = [[10, 18, 11], [13, 15, 8], [9, 20, 3]]
df = pd.DataFrame(data )
df

Unnamed: 0,0,1,2
0,10,18,11
1,13,15,8
2,9,20,3


In [None]:
# By default, the sum method adds values accross rows and returns the sum for each column
df.sum()

0    32
1    53
2    22
dtype: int64

In [None]:
# By specifying the column axis (axis='columns'), the sum() method add values accross columns and returns the sum of each row.
df.sum(axis = 'columns')

0    39
1    36
2    32
dtype: int64

In [None]:
# in python, axis=1 stands for column, while axis=0 stands for rows
df.sum(axis = 1)

0    39
1    36
2    32
dtype: int64