<a href="https://colab.research.google.com/github/Mdyeban20/CMSC126-Lab1/blob/master/Yeban_assignment1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Download the dataset from:  https://github.com/bellawillrise/Introduction-to-Numerical-Computing-in-Python/

Submit a pdf file, which is a rendered saved version of the jupyter notebook.  Make sure to execute all the codes so the output can be viewed in the pdf.

Also include the link to the public github repository where the jupyter notebook for the assignment is uploaded.

Link to the github repository: <</insert link>>

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# %matplotlib inline

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
data = pd.read_csv('/content/drive/MyDrive/movie_metadata_cleaned.csv')

In [None]:
data.head(2)

In [None]:
result = data[data['director_name'] == '0']

## Get the top 10 directors with most movies directed and use a boxplot for their gross earnings

In [None]:

data['director_name'].replace([0, '0'], pd.NA, inplace=True)
data['gross'] = pd.to_numeric(data['gross'], errors='coerce')
data['budget'] = pd.to_numeric(data['budget'], errors='coerce')

clean_data = data.dropna(subset=['director_name', 'gross', 'budget'])
clean_data = clean_data[(clean_data['gross'] > 0) | (clean_data['budget'] > 0)]

top_10 = clean_data['director_name'].value_counts()[:10]
print(top_10)

total_gross_by_director = clean_data.groupby('director_name')['gross'].sum().sort_values(ascending=False)

top_10_by_gross = total_gross_by_director[total_gross_by_director.index.isin(top_10.index)]
filtered_data = clean_data[clean_data['director_name'].isin(top_10_by_gross.index)]

plt.figure(figsize=(12, 8))
sns.boxplot(x='director_name', y='gross', data=filtered_data, palette='viridis')
print(top_10_by_gross)

plt.xticks(rotation=45)
plt.title('Gross Earnings of Top 10 Directors with Most Movies Directed')
plt.xlabel('Director Name')
plt.ylabel('Gross Earnings')

plt.show()



## Plot the following variables in one graph:

- num_critic_for_reviews
- IMDB score
- gross

In [None]:
print(data.columns)

In [None]:
pair_data = clean_data[['gross', 'num_critic_for_reviews', 'imdb_score']]
sns.pairplot(pair_data)
plt.show()

## Compute Sales (Gross - Budget), add it as another column

In [None]:
clean_data['sales'] = clean_data['gross'] - data['budget']
display = clean_data[['movie_title','director_name','gross','budget','sales']].head(10)
print(display)

## Which directors garnered the most total sales?

In [None]:
total_sales_by_director = clean_data.groupby('director_name')['sales'].sum().sort_values(ascending=False)[:10]
print(total_sales_by_director)

## Plot sales and average likes as a scatterplot. Fit it with a line.

In [None]:
plt.figure(figsize=(10, 6))
sns.regplot(x='sales', y='movie_facebook_likes', data=clean_data, ci=None, line_kws={"color": "red", "alpha": 0.7})

plt.title('Sales vs Average Likes with Line Fit')
plt.xlabel('Sales')
plt.ylabel('Average Likes')

plt.show()

## Which of these genres are the most profitable? Plot their sales using different histograms, superimposed in the same axis.

- Romance
- Comedy
- Action
- Fantasy

In [None]:
romance_sales = clean_data[clean_data['genres'] == 'Romance']['sales']
comedy_sales = clean_data[clean_data['genres'] == 'Comedy']['sales']
action_sales = clean_data[clean_data['genres'] == 'Action']['sales']
fantasy_sales = clean_data[clean_data['genres'] == 'Fantasy']['sales']

plt.figure(figsize=(10, 6))

plt.hist(romance_sales, bins=10, alpha=0.5, label='Romance', edgecolor='black')
plt.hist(comedy_sales, bins=10, alpha=0.5, label='Comedy', edgecolor='black')
plt.hist(action_sales, bins=10, alpha=0.5, label='Action', edgecolor='black')
plt.hist(fantasy_sales, bins=10, alpha=0.5, label='Fantasy', edgecolor='black')

plt.title('Sales Distribution by Genre')
plt.xlabel('Sales')
plt.ylabel('Frequency')
plt.legend(loc='upper right')

plt.show()

## For each of movie, compute average likes of the three actors and store it as a new variable

Read up on the mean function.

Store it as a new column, average_actor_likes.

In [None]:
print(data.columns)

In [None]:

clean_data['average_actor_likes'] = data[['actor_1_facebook_likes', 'actor_2_facebook_likes', 'actor_3_facebook_likes']].mean(axis=1).astype(int)
display = clean_data[['movie_title','average_actor_likes']].head(10)

print(display)


## Copying the whole dataframe

In [None]:
df = data.copy()
df.head()

## Min-Max Normalization

Normalization is a technique often applied as part of data preparation for machine learning. The goal of normalization is to change the values of numeric columns in the dataset to a common scale, without distorting differences in the ranges of values. For machine learning, every dataset does not require normalization. It is required only when features have different ranges.

The min-max approach (often called normalization) rescales the feature to a hard and fast range of [0,1] by subtracting the minimum value of the feature then dividing by the range. We can apply the min-max scaling in Pandas using the .min() and .max() methods.

$$
x_{scaled} = \frac{x-x_{min}}{x_{max}-x_{min}}
$$

### Normalize each numeric column (those that have types integer or float) of the copied dataframe (df)

In [None]:
numeric_columns = df.select_dtypes(include=['int', 'float']).columns

print("Numeric columns:", numeric_columns)

for column in numeric_columns:
    min_val = df[column].min()
    max_val = df[column].max()
    df[column] = (df[column] - min_val) / (max_val - min_val)

print("Normalized Data:")
print(df)