## Data scraping  

In [8]:
!pip install beautifulsoup4



In [12]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [13]:
name = [] 
ratings=[]
genre=[]
year=[]

In [14]:
url = "https://www.imdb.com/search/title/?title_type=tv_series&num_votes=100000,&sort=release_date,asc"
content = requests.get(url).content

In [42]:
soup = BeautifulSoup(content,"html.parser")

for a in soup.findAll('div', attrs={'class':'lister-item mode-advanced'}):
    h = a.find('h3', attrs={'class':'lister-item-header'})
    a_name=h.find('a', href=True)
    a_rating=a.find('div', attrs={'class':'inline-block ratings-imdb-rating'})
    a_genre=a.find('span', attrs={'class':'genre'})
    a_year=a.find('span', attrs={'class':'lister-item-year text-muted unbold'})
    name.append(a_name.text)
    ratings.append(a_rating.text.strip("\n"))
    genre.append(a_genre.text.strip("\n"))
    year.append(a_year.text)


df=pd.DataFrame({'Series Name' : name, 'Release Years' : year, 'Ratings' : ratings, 'Genre' : genre})


In [43]:
df.head()

Unnamed: 0,Series Name,Release Years,Ratings,Genre
0,Married... with Children,(1987–1997),8.1,Comedy
1,Star Trek: The Next Generation,(1987–1994),8.7,"Action, Adventure, Drama"
2,Seinfeld,(1989–1998),8.9,Comedy
3,Twin Peaks,(1990–1991),8.8,"Crime, Drama, Mystery"
4,The Simpsons,(1989– ),8.7,"Animation, Comedy"


## Classification  

In [99]:
#import nessesary libraries 
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_validate, StratifiedKFold, GridSearchCV
from sklearn.metrics import make_scorer, precision_score, accuracy_score, classification_report, recall_score, f1_score

In [100]:
df['Features'] = df['Series Name'] + ' ' + df['Release Years'] + ' ' + df['Ratings'].astype(str)

In [101]:
X_train, X_test, y_train, y_test = train_test_split(df['Features'], df['Genre'], test_size=0.2, random_state=42)

# Create a pipeline with TF-IDF vectorizer and Multinomial Naive Bayes
model = make_pipeline(TfidfVectorizer(), MultinomialNB())
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print(f"Accuracy: {accuracy}")
print("\nClassification Report:\n", report)

Accuracy: 0.4

Classification Report:
                                           precision    recall  f1-score   support

    Action, Adventure, Drama                   0.00      0.00      0.00         1
        Action, Crime, Drama                   0.00      0.00      0.00         1
    Adventure, Drama, Sci-Fi                   0.00      0.00      0.00         2
Animation, Action, Adventure                   0.11      1.00      0.20         1
   Animation, Action, Comedy                   0.00      0.00      0.00         1
           Animation, Comedy                   0.00      0.00      0.00         3
                      Comedy                   0.50      1.00      0.67         2
               Comedy, Drama                   0.67      1.00      0.80         4
             Comedy, Romance                   0.00      0.00      0.00         1
       Crime, Drama, History                   0.00      0.00      0.00         1
       Crime, Drama, Mystery                   1.00      1

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


The initial model trained on TF-IDF vectorized data with a Multinomial Naive Bayes classifier resulted in an accuracy of 0.4. Looking at the classification report, it's evident that the model performed poorly across various genres. The precision, recall, and f1-score values are low for most genres, indicating a lack of predictive power. The macro and weighted averages are also low, suggesting an overall weak model.

The initial model used default hyperparameters. Given the diversity of genres and the complexity of language in TV show descriptions, fine-tuning hyperparameters is essential for improving model performance.GridSearchCV allows us to systematically search through a range of hyperparameters, finding the combination that maximizes performance. In this case, the tuned hyperparameters likely led to a more effective model.

In [104]:
# Define the parameter grid
param_grid = {
    'tfidfvectorizer__ngram_range': [(1, 1), (1, 2), (2, 2)],
    'multinomialnb__alpha': [0.1, 0.5, 1.0, 2.0]
}

grid_search = GridSearchCV(model, param_grid, cv=5)
grid_search.fit(X_train, y_train)


y_pred = grid_search.best_estimator_.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print(f"Accuracy: {accuracy}")
print("\nClassification Report:\n", report)

print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

Accuracy: 0.8

Classification Report:
                                           precision    recall  f1-score   support

    Action, Adventure, Drama                   1.00      1.00      1.00         1
        Action, Crime, Drama                   0.33      1.00      0.50         1
    Adventure, Drama, Sci-Fi                   0.00      0.00      0.00         2
Animation, Action, Adventure                   0.33      1.00      0.50         1
   Animation, Action, Comedy                   1.00      1.00      1.00         1
           Animation, Comedy                   1.00      1.00      1.00         3
                      Comedy                   1.00      1.00      1.00         2
               Comedy, Drama                   1.00      0.50      0.67         4
             Comedy, Romance                   1.00      1.00      1.00         1
       Crime, Drama, History                   1.00      1.00      1.00         1
       Crime, Drama, Mystery                   1.00      1

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


A GridSearchCV approach was used to explore different hyperparameter combinations for the TF-IDF vectorizer and Multinomial Naive Bayes classifier. The best estimator from the grid search was then evaluated on the test set, resulting in an improved accuracy of 0.8 signaling a significant improvement in the model's ability to correctly predict TV show genres. The classification report for the improved model shows higher precision, recall, and f1-score values across various genres. The macro and weighted averages are also improved, indicating a more robust and reliable model.

Understanding where the model makes mistakes can provide valuable insights into its decision-making process. Examining misclassifications helps identify patterns, genres that are challenging, or specific descriptions that lead to errors. If there are consistent patterns in the misclassifications, it can guide further model improvement efforts. For example, if certain genres are commonly confused, it might indicate a need for more diverse training data or a more complex model.

In [66]:
incorrect_predictions = X_test[y_pred != y_test]
true_labels = y_test[y_pred != y_test]
predicted_labels = y_pred[y_pred != y_test]
for i in range(min(5, len(incorrect_predictions))):
    print(f"True Label: {true_labels.iloc[i]} | Predicted Label: {predicted_labels[i]}")
    print(f"Text: {incorrect_predictions.iloc[i]}\n")



True Label: Adventure, Drama, Sci-Fi             | Predicted Label: Animation, Action, Adventure            
Text: Firefly (2002–2003) 9.0

True Label: Comedy, Drama             | Predicted Label: Action, Crime, Drama            
Text: Scrubs (2001–2010) 8.4

True Label: Comedy, Drama             | Predicted Label: Action, Crime, Drama            
Text: Scrubs (2001–2010) 8.4

True Label: Adventure, Drama, Sci-Fi             | Predicted Label: Animation, Action, Adventure            
Text: Firefly (2002–2003) 9.0



In Step A, the Multinomial Naive Bayes model with TF-IDF vectorization demonstrated limitations, especially in misclassifying genres like Adventure, Drama, Sci-Fi and Comedy, Drama. I will change the model to the randomforest model. Random Forests can handle non-linear relationships between features and labels, providing flexibility that may be lacking in a linear model like Naive Bayes. This could potentially address the misclassifications observed before.

In [96]:
#i will use random forest classifier to improve the rsults 
model2 = make_pipeline(TfidfVectorizer(), RandomForestClassifier(n_estimators=100))
model2.fit(X_train, y_train)
y_pred_improved = model2.predict(X_test)

accuracy_improved = accuracy_score(y_test, y_pred_improved)
report_improved = classification_report(y_test, y_pred_improved)

print(f"Improved Model Accuracy: {accuracy_improved}")
print("\nImproved Model Classification Report:\n", report_improved)

print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred_improved))

Improved Model Accuracy: 0.8

Improved Model Classification Report:
                                           precision    recall  f1-score   support

    Action, Adventure, Drama                   1.00      1.00      1.00         1
        Action, Crime, Drama                   0.20      1.00      0.33         1
    Adventure, Drama, Sci-Fi                   0.00      0.00      0.00         2
Animation, Action, Adventure                   1.00      1.00      1.00         1
   Animation, Action, Comedy                   1.00      1.00      1.00         1
           Animation, Comedy                   1.00      1.00      1.00         3
                      Comedy                   1.00      1.00      1.00         2
               Comedy, Drama                   1.00      0.50      0.67         4
             Comedy, Romance                   1.00      1.00      1.00         1
       Crime, Drama, History                   1.00      1.00      1.00         1
       Crime, Drama, Mystery

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Both the Random Forest model and the Grid Search-tuned Multinomial Naive Bayes model exhibit the same accuracy of 0.8, indicating that they correctly predicted the genre for 80% of the test data. In terms of macro and weighted average F1-scores, the Random Forest model outperforms the Grid Search-tuned Multinomial Naive Bayes model. The Random Forest model achieves higher scores in both metrics, indicating better overall precision, recall, and balance between precision and recall across different genres.The confusion matrices provide insights into the performance of each model. In the Random Forest model, the diagonal elements have higher values, indicating correct predictions for various genres. The Grid Search-tuned Multinomial Naive Bayes model, while achieving the same accuracy, has lower values on the diagonal, suggesting a less robust performance compared to the Random Forest.