## Project 2 - Films

The data was found on a [website](https://perso.telecom-paristech.fr/eagan/class/igr204/datasets) that provides datasets suitable for programming projects.

It was discovered during a Google search for "Free Datasets". 

The link above was clicked, followed by "Download csv file." under "Films" to access the dataset.

The size of the CSV file is 160 KB (0.16 MB) and it contains information about films and their characteristics.

This analysis aims to answer the following questions:
* Is there an association between genre and award status?
* What percentage of films received a [high popularity rating](#definition) and an award for the film they produced?
* How well can a film's [age](#definition) predict its popularity rating?

In [None]:
#import libraries
import datetime
import matplotlib.pyplot as plot
import pandas as pd
from scipy.stats import chi2_contingency
import seaborn as sns 
from sklearn.linear_model import LinearRegression

In [None]:
#load data 
file = "https://perso.telecom-paristech.fr/eagan/class/igr204/data/film.csv"
film = pd.read_csv(file, sep= None, engine = "python", encoding = "latin-1")

#view first 5 rows of dataset
film.head(5)

### Data Cleaning and Preparation 
The following steps are conducted to ensure the data is ready for manipulation:
1) Drop the first row of the dataset that denotes the datatypes of the columns.  
2) Rename the "Subject" column to "Genre" column.  
3) Create a "Current Year" column and subtract the "Year" column from it to make an "Age" column.  
4) Drop the "Current Year" column.  
5) Convert the datatypes of the columns that are classified as objects.  
6) Drop all rows with at least one null value since it totals to less than 5% of the observations in the dataset (rule of thumb according to [Statistics Solutions](#reference) - "In statistical language, if the number of the cases [with missing values] is less than 5% of the sample, then the researcher can drop them"). 

In [None]:
#determine number of rows and columns
film.shape

In [None]:
#remove column types under headers for relevancy purposes
film = film.drop([0])

#remove multiple columns for relevancy purposes
film = film.drop(columns = ["*Image","Actor","Actress","Director"])

In [None]:
#rename subject column to genre
film = film.rename(columns={"Subject": "Genre"})

In [None]:
#create current year column
film["Current Year"] = datetime.datetime.now().year
film.head(5)

#convert current year and year columns to integer datatype
film["Current Year"] = film["Current Year"].astype("int64")
film["Year"] = film["Year"].astype("int64")

#create age column 
film["Age"] = film["Current Year"] - film["Year"]

#remove current year column for relevancy purposes
film = film.drop(columns = "Current Year")

In [None]:
#display general info about dataframe
film.info()

In [None]:
#convert datatypes of remaining columns 
film["Year"] = film["Year"].astype("category")
film["Length"] = film["Length"].astype("float64")
film["Title"] = film["Title"].astype("str")
film["Genre"]= film["Genre"].astype("category")
film["Popularity"] = film["Popularity"].astype("float64")
film["Awards"] = film["Awards"].astype("category")

In [None]:
#find all rows with at least one null value
n_data = film[film.isna().any(axis=1)]

#determine if percentage of rows with null values in dataframe is less than 5% 
print(round(len(n_data)/len(film)*100,2) < 5)

#drop the rows from the dataset 
film = film.dropna()

In [None]:
#display first five rows of modified dataset
film.head(5)

### Exploratory Data Analysis

The dataset initially had 1,660 observations and 10 variables, but after cleaning and preparing it for analysis, it
now has 1,585 observations and 7 variables. 

Majority of the films are classified as a drama at 40.00%, were produced in 1991 at 7.51%, and did not win any awards at 89.97%.

The average length of the films is 105.28 minutes, the average popularity rating is 43.21 points, and the average age of the films is 45.44 years. 

As for the minimum and maximum, it is 5.00 and 450.00 minutes for length, a 0.00 and 88.00 point rating for popularity, and 24.00 and 101.00 years for age.

The standard deviation of the length, popularity, and age of the films are 30.51 minutes, 26.71 points, and 17.02 years, respectively.

In [None]:
#determine number of rows and columns of modified dataset
film.shape

#### Categorical Variables 

In [None]:
#display frequency and percentage for genre in descending order
genre = pd.crosstab(index = film["Genre"],columns = "Frequency") 
genre["Percentage"] = round(genre/genre.sum()*100,2)
genre.sort_values("Frequency", ascending = False)

In [None]:
#display frequency and percentage for production year in descending order
year = pd.crosstab(index = film["Year"],columns = "Frequency") 
year["Percentage"] = round(year/year.sum()*100,2)
year.sort_values("Frequency", ascending = False)

In [None]:
#display frequency and percentage for award status in descending order
award = pd.crosstab(index = film["Awards"],columns = "Frequency") 
award["Percentage"] = round(award/award.sum()*100,2)
award.sort_values("Frequency", ascending = False)

#### Numerical Variables

In [None]:
#display summary statitics for each variable 
round(film.describe(),2)

### Subsetting Dataframe

What percentage of films received a high popularity rating and an award for the film they produced?

In [None]:
#filtered to retrieve rows that correspond to films with a high popularity rating and an award
hpra = film[(film["Popularity"]>= 67.00) & (film["Awards"]== "Yes")]

#display first five rows
hpra.head(5)

In [None]:
#determine percentage of films with a high rating and an award 
print("The percentage of films with a high rating and an award is " + str(round(len(hpra)/len(film)*100,1)) +"%.")

In [None]:
#initialize empty list
r_and_a = []

#filter rows into remaining categories
hpr = film[(film["Popularity"]>= 67.00) & (film["Awards"]== "No")]
ipr = film[(film["Popularity"] <= 66.00) & (film["Popularity"] >= 20.00 ) & (film["Awards"]== "No")]
lpr = film[(film["Popularity"] <= 19.00) & (film["Awards"]== "No")]
ipra = film[(film["Popularity"] <= 66.00) & (film["Popularity"] >= 20.00 ) & (film["Awards"]== "Yes")]
lpra = film[(film["Popularity"]<= 19.00) & (film["Awards"]== "Yes")]

#append number of rows in each category to list
r_and_a.append(len(hpr))
r_and_a.append(len(ipr))
r_and_a.append(len(lpr))
r_and_a.append(len(hpra))
r_and_a.append(len(ipra))
r_and_a.append(len(lpra))

### Chi-Square Test of Independence

Is there an association between genre and award status?

*Null Hypothesis*: There is no relationship between genre and award status (independent). 

*Alternative Hypothesis*: There is a relationship between genre and award status (dependent). 

The p-value is the probability of the null hypothesis being true. 

If the p-value is greater than 0.05, we fail to reject the null hypothesis and conclude genre and award status are independent.

If the p-value is less than 0.05, we reject the null hypothesis and conclude genre and award status are dependent. 

In [None]:
#categorical crosstab
cat_corr = pd.crosstab(film["Genre"],film["Awards"])

In [None]:
#perform Chi-sq test
ChiSqResult = chi2_contingency(cat_corr)

In [None]:
#print p-value of test
print("The p-value of the ChiSq Test is", ChiSqResult[1])

In [None]:
#compare p-value to alpha value (significance level)
print(ChiSqResult[1] < 0.05)

### Linear Regression

How well can a film's age predict its popularity rating?

*Independent Variable*: Age

*Dependent Variable*: Popularity

Interpretation of Coefficient
* If the age of the film increases by 1 year, the popularity rating is predicted to increase by 0.15.

Interpretation of Intercept
* When the age of a movie is 0, the popularity rating is predicted to be 36.37.

In [None]:
#identify the variables for the model 
x = film[["Age"]]
y = film[["Popularity"]]

In [None]:
#define and fit the multiple linear regression model
regr = LinearRegression()
regr.fit(x, y)

In [None]:
#determine the coefficient and intercept
print("Coefficient: \n", regr.coef_)
print("Intercept: \n", regr.intercept_)

In [None]:
#predict the popularity of a 30 year old movie
apred_pop = regr.predict([[30]])
apred_pop 

In [None]:
#predict the popularity of a 40 year old movie
apred_pop2 = regr.predict([[40]])
apred_pop2

In [None]:
#confirm the accuracy of the coefficient
conf = 40.88276673 + (10*0.15042767)
conf

In [None]:
#predict the popularity of a 0 year old movie and confirm the accuracy of the intercept
apred_pop3 = regr.predict([[0]])
apred_pop3

In [None]:
#determine correlation between age and popularity
film["Age"].corr(film["Popularity"])

### Data Visualizations

In [None]:
#display proportions of each rating and award category with list created under Subsetting Dataframe 
Data = r_and_a
my_labels = "High Ratings, No Award","Intermediate Ratings, No Award", "Low Ratings, No Award", "High Ratings with Award", "Intermediate Ratings with Award", "Low Ratings with Award"
my_colors = ["indianred","lightcoral","rosybrown","plum","pink","palevioletred"]
my_explode = (0,0,0,0.12,0,0)
plot.pie(Data, labels = my_labels, autopct = "%1.1f%%", startangle = 15, shadow = True, colors = my_colors, explode = my_explode)
plot.title("Figure 1. Rating Range and Award Status Proportions", fontweight="bold")
plot.axis("Equal")
plot.show()

In [None]:
#display genre with award status as another dimension 
sns.countplot(x = "Genre", hue = "Awards", data = film,  palette = "flare")
plot.xticks(rotation = 90)
plot.title("Figure 2. Film Genre by Award Status", fontweight="bold")
plot.ylabel("Count")
plot.legend(loc = "upper right", title = "Award Recieved")
plot.show()

In [None]:
#display regression plot for age and popularity
sns.regplot(x = film["Age"], y = film["Popularity"],fit_reg = True, color = "mediumorchid")
plot.title("Figure 3. Popularity Rating by Age of Film", fontweight="bold")
plot.show()

In [None]:
#display popularity rating by award status  
sns.stripplot(x = "Awards", y = "Popularity", data = film, palette = "flare")
plot.xticks(rotation = 90)
plot.title("Figure 4. Popularity Rating by Award Status", fontweight="bold")
plot.show()

### Analytical Findings 

* There is evidence that gender and award status are related. The drama category, followed by the comedy category encompass the films with the most awards (see Figure 2). 


* Films with both high ratings and an award constitute 3% of the observations in the dataset (see Figure 1).    


* There is a positive correlation between a film's age and popularity rating, meaning as the former increases, the latter does as well (see Figure 3).  

### Implications
* Majority of the films in the dataset fall into the drama and comedy categories and a very small proportion of the films recieved awards, so the drama and comedy films recieving the most awards is statistically more likely.


* There is a lower proportion of films with high ratings and an award in comparison to both films with intermediate ratings and an award (3.8%) and films with low ratings and an award (3.2%). This suggests that popularity rating has no impact on whether or not a film recieves an award (see Figure 4).


* While there is a positive correlation between a film's age and popularity, it is an extremely weak one with a ~0.1 correlation coefficient. 

### Reference

"Missing Values in Data", Accessed from [Statistics Solutions](https://www.statisticssolutions.com/missing-values-in-data/) on April 1, 2021.

<a id='reference'></a>

### Appendix  

*Age*: The (film production) year subtracted from the current year.   

*High Popularity Rating*: Falls on a scale from 67.00 (75th percentile) to 88.00 (maximum rating).  

*Intermediate Popularity Rating*: Falls on a scale from 20.00 to 66.00 (between 25th and 75th percentile).  

*Low Popularity Rating*: Falls on a scale from 0.00 (minimum rating) to 19.00 (25th percentile).  

<a id='definition'></a>