# **Classification of Netflix Titles Using Machine Learning**

## **OBJECTIVES**
We classify Netflix titles into Movie vs TV Show.

**Business value:** content tagging, recommendation systems, catalog management

__Import Necessary Libraries & load data__

In [None]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns 

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder 
from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.metrics import accuracy_score, classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier 

df = pd.read_csv(r"netflix_titles.csv")
df.head()

In [None]:
plt.figure(figsize=(6,4))
sns.countplot(x='type', data=df)
plt.title("Class Distribution")
plt.show()

**TOP 10 GENRES**

In [None]:
plt.figure(figsize=(10,5))
df['listed_in'].value_counts().head(10).plot(kind='bar')
plt.title("Top 10 Genres")
plt.xlabel("Genre")
plt.ylabel("Count")
plt.show()


**RELEASE YEAR TREND**

In [None]:
plt.figure(figsize=(10,5))
df['release_year'].value_counts().sort_index().plot()
plt.title("Number of Titles Released Per Year")
plt.xlabel("Year")
plt.ylabel("Count")
plt.show()


#    **Basic Description**


In [None]:
df.info()

In [None]:
df.describe(include='all')

In [None]:
df.isnull().sum()

**HANDLING MISSING VALUES**

In [None]:
df['director'] = df['director'].fillna('Unknown')
df['cast'] = df['cast'].fillna('Not Available')
df['country'] = df['country'].fillna('Unknown Country')

In [None]:
df = df.dropna(subset=['date_added'])

df['rating'] = df['rating'].fillna(df['rating'].mode()[0])
df['duration'] = df['duration'].fillna(df['duration'].mode()[0])



In [None]:
df.isnull().sum()

**ENCODE THE TARGET (MOVIE/TV SHOW)**

In [None]:
le = LabelEncoder()
df['type_encoded'] = le.fit_transform(df['type'])

**SELECT FEATURES**

In [None]:
df['combined_text'] = df['title']+ "" + df['director']+ ""+ df['cast']+" "+ df['listed_in']+" "+df['description']

In [None]:
X = df['combined_text']
y = df['type_encoded']

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=42)

**CONVERT TEXT TO NUMBERS**

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(stop_words='english', max_features=5000)

X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

**MODEL 1: LOGISTIC REGRESSION**

In [None]:
model_lr = LogisticRegression(max_iter=2000)

model_lr.fit(X_train_tfidf, y_train)

pred_lr = model_lr.predict(X_test_tfidf)

**MODEL 2: DECISION TREE CLASSIFIER**

In [None]:
model_dt = DecisionTreeClassifier(random_state=42)
model_dt.fit(X_train_tfidf, y_train)

pred_dt = model_dt.predict(X_test_tfidf)

**MODEL 3: RANDOM FOREST CLASSIFIER**

In [None]:
model_rf = RandomForestClassifier(n_estimators=2000, random_state=42)
model_rf.fit(X_train_tfidf,y_train)

pred_rf = model_rf.predict(X_test_tfidf)

**MODEL ACCURACY**

In [None]:
acc_lr = accuracy_score(y_test, pred_lr)
acc_dt = accuracy_score(y_test, pred_dt)
acc_rf = accuracy_score(y_test, pred_rf)

print("Logistic Regression Accuracy -> ", acc_lr)
print("Decision Tree Accuracy -> ", acc_dt)
print("Random Forest Accuracy -> ",acc_rf)

**COMPARE ALL MODELS AND CHOOSE THE BEST MODEL**

In [None]:
results = { 
    "Logistic Regression": acc_lr,
    "Decision Tree" : acc_dt, 
    "Random Forest" : acc_rf
}
 

best_model = max(results, key=results.get)
print("Best Model:", best_model)

**FINAL SUMMARY**

-Random Forest achieved the highest accuracy.
- Key drivers: duration, genre keywords, release year.
- Recommendation: use Random Forest for deployment.
- Next Steps: NLP on description, director/cast embeddings.