## Movie Revenue Prediction 

### Objective: Your client is a movie studio and they need to be able to predict movie revenue in order to greenlight the project and assign a budget to it. 
- Most of the data is comprised of categorical variables. 
- While the budget for the movie is known in the dataset, it is often an unknown variable during the greenlighting process. 

### Target Variable: Movie Revenue (movies_clean['revenue'])

#### Split data into X (features) and y (target)

In [None]:
# Separate out the features (X)
X_features = movies_clean.copy().drop(columns='revenue')
print(X_features.shape)

# Save feature column labels in list
feature_list = list(X_features.columns)

# Separate out the target (y)
y = movies_clean['revenue'].values
print(y.shape)

## t-Distributed Stochastic Neighbor Embedding (t-SNE)
- maps the multi-dimensional data to a lower dimensional space
- finds patterns in the data by identifying observed clusters based on similarity of data points with multiple features

In [None]:
# Import necessary modules
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

In [None]:
# Create a PCA instance
pca = PCA(n_components=4799, whiten=False, random_state=42)

# Fit/Transform pca model to features
X_pca = pca.fit_transform(X_features)
X_pca_reconst = pca.inverse_transform(X_pca)

In [None]:
X_pca.shape

In [None]:
plt.figure(figsize=(12,12))

plt.scatter(X_pca[y==0, 0], X_pca[y==0, 1], color='red', alpha=0.5,label='0')
plt.scatter(X_pca[y==1, 0], X_pca[y==1, 1], color='blue', alpha=0.5,label='1')
plt.scatter(X_pca[y==2, 0], X_pca[y==2, 1], color='green', alpha=0.5,label='2')
plt.scatter(X_pca[y==3, 0], X_pca[y==3, 1], color='black', alpha=0.5,label='3')
plt.scatter(X_pca[y==4, 0], X_pca[y==4, 1], color='khaki', alpha=0.5,label='4')
plt.scatter(X_pca[y==5, 0], X_pca[y==5, 1], color='yellow', alpha=0.5,label='5')
plt.scatter(X_pca[y==6, 0], X_pca[y==6, 1], color='turquoise', alpha=0.5,label='6')
plt.scatter(X_pca[y==7, 0], X_pca[y==7, 1], color='pink', alpha=0.5,label='7')
plt.scatter(X_pca[y==8, 0], X_pca[y==8, 1], color='moccasin', alpha=0.5,label='8')
plt.scatter(X_pca[y==9, 0], X_pca[y==9, 1], color='olive', alpha=0.5,label='9')
plt.scatter(X_pca[y==10, 0], X_pca[y==10, 1], color='coral', alpha=0.5,label='10')
plt.title("Sample Size of first 10 PCA within Movie Data")
plt.ylabel('Les coordonnees de Y')
plt.xlabel('Les coordonnees de X')
plt.legend()
plt.show()