<a href="https://colab.research.google.com/github/Maracetta/2401PTDS_Regression_Project/blob/main/UnsupervisedNotebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Unsupervised Machine Learning

### Project Title: Unsupervised
#### Done By: Karen van den Heever

© ExploreAI 2024

---

## Table of Contents

<a href=#BC> Background Context</a>

<a href=#one>1. Importing Packages</a>

<a href=#two>2. Data Collection and Description</a>

<a href=#three>3. Loading Data </a>

<a href=#four>4. Data Cleaning and Filtering</a>

<a href=#five>5. Exploratory Data Analysis (EDA)</a>

<a href=#six>6. Modeling </a>

<a href=#seven>7. Evaluation and Validation</a>

<a href=#eight>8. Final Model</a>

<a href=#nine>9. Conclusion and Future Work</a>

<a href=#ten>10. References</a>

---
 <a id="BC"></a>
## **Background Context**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Introduce the project, outline its goals, and explain its significance.
* **Details:** Include information about the problem domain, the specific questions or challenges the project aims to address, and any relevant background information that sets the stage for the work.
---

---
<a href=#one></a>
## **Importing Packages**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Set up the Python environment with necessary libraries and tools.
* **Details:** List and import all the Python packages that will be used throughout the project such as Pandas for data manipulation, Matplotlib/Seaborn for visualization, scikit-learn for modeling, etc.
---

In [1]:
# data processing
import numpy as np
import pandas as pd
import datetime
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from scipy.cluster.hierarchy import ward, dendrogram
import scipy.cluster.hierarchy as sch
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics.pairwise import cosine_similarity

# plotting
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import string
import math
import re

from collections import Counter

import nltk
nltk.download(['punkt','stopwords','wordnet','omw-1.4'])
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

from wordcloud import WordCloud

from scipy.sparse import hstack, csr_matrix

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


---
<a href=#two></a>
## **Data Collection and Description**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Describe how the data was collected and provide an overview of its characteristics.
* **Details:** Mention sources of the data, the methods used for collection (e.g., APIs, web scraping, datasets from repositories), and a general description of the dataset including size, scope, and types of data available (e.g., numerical, categorical).
---

Data Collection:

Gather movie data, including genres, ratings, and user information. Sources could include publicly available datasets like MovieLens or scraping websites if necessary and allowed.
Data Preprocessing:

Clean the data by handling missing values, duplicates, and outliers.
Normalize the ratings to bring different users' ratings into a common scale if necessary.
Feature Engineering:

Create user profiles based on historical ratings and genres. For example, calculate the average rating a user gives to each genre.
Consider user interaction history, such as the number of movies watched in each genre.
Clustering (Unsupervised Learning):

Use clustering algorithms like K-Means, Hierarchical Clustering, or DBSCAN to group users with similar tastes based on their profiles.
The resulting clusters can represent different 'types' of users with shared preferences.
Rating Prediction:

For a new movie, identify the cluster(s) most similar to the user in question and use the average ratings of that cluster to predict the user's possible rating.
Alternatively, you could use collaborative filtering techniques that can incorporate the clusters or user profiles you've generated.
Evaluation:

While unsupervised learning doesn't inherently provide a straightforward accuracy metric like supervised learning, you can validate your approach using techniques such as cross-validation with some form of ground truth, like a hold-out set.
Iteration:

Experiment with different clustering algorithms and feature combinations to find what works best for your data.
Gather feedback and refine your user representation and clustering methodology.
Tools and Libraries:

Consider using libraries like Scikit-learn for clustering and feature engineering, and tools like Pandas and NumPy for data manipulation.

---
<a href=#three></a>
## **Loading Data**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Load the data into the notebook for manipulation and analysis.
* **Details:** Show the code used to load the data and display the first few rows to give a sense of what the raw data looks like.
---

In [2]:
# Loading dataset
df_train = pd.read_csv("train.csv", delimiter=',', index_col=False)
df_train['split'] = "train"
df_train.head()

FileNotFoundError: [Errno 2] No such file or directory: 'train.csv'

In [None]:
df_train = df_train.head(4000000)

In [None]:
# Loading dataset
df_test = pd.read_csv("test.csv", delimiter=',', index_col=False)
df_test['split'] = "test"
df_test.head()

Unnamed: 0,user_id,anime_id,split
0,40763,21405,test
1,68791,10504,test
2,40487,1281,test
3,55290,165,test
4,72323,11111,test


In [None]:
# Loading dataset
df_anime = pd.read_csv("anime.csv", delimiter=',', index_col=False)
df_anime.head()

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266


In [None]:
# Concatenate the DataFrames
df_combined = pd.concat([df_train, df_test], ignore_index=True)
df_combined.head()

Unnamed: 0,user_id,anime_id,rating,split
0,1,11617,10.0,train
1,1,11757,10.0,train
2,1,15451,10.0,train
3,2,11771,10.0,train
4,3,20,8.0,train


In [None]:
df_combined.tail()

Unnamed: 0,user_id,anime_id,rating,split
4633681,7345,2768,,test
4633682,26511,6351,,test
4633683,18270,2369,,test
4633684,27989,20507,,test
4633685,42114,9756,,test


In [None]:
# open up memory - drop the df
del df_test

In [None]:
# open up memory - drop the df
del df_train

---
<a href=#four></a>
## **Data Cleaning and Filtering**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Prepare the data for analysis by cleaning and filtering.
* **Details:** Include steps for handling missing values, removing outliers, correcting errors, and possibly reducing the data (filtering based on certain criteria or features).
---

In [None]:
# Perform left join
df = pd.merge(df_combined, df_anime, how='left', on=['anime_id'])
df.head(2)

Unnamed: 0,user_id,anime_id,rating_x,split,name,genre,type,episodes,rating_y,members
0,1,11617,10.0,train,High School DxD,"Comedy, Demons, Ecchi, Harem, Romance, School",TV,12,7.7,398660.0
1,1,11757,10.0,train,Sword Art Online,"Action, Adventure, Fantasy, Game, Romance",TV,25,7.83,893100.0


In [None]:
# open up memory - drop the df
del df_combined

In [None]:
# Convert all string type columns to lowercase
df = df.apply(lambda x: x.str.lower() if x.dtype == "object" else x)

In [None]:
print(string.punctuation)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [None]:
def remove_punctuation(post):
    # Define a set of punctuation characters, including alternative apostrophes
    custom_punctuation = string.punctuation + "’‘“”"
    # Convert input to string, handle NaN by returning empty string
    if isinstance(post, float) and np.isnan(post):
        return ''
    post = str(post)
    # Return the string without any of these punctuation characters
    return ''.join([l for l in post if l not in custom_punctuation])

In [None]:
df['cleaned_name'] = df['name'].apply(remove_punctuation)

In [None]:
df.head()

Unnamed: 0,user_id,anime_id,rating_x,split,name,genre,type,episodes,rating_y,members,cleaned_name
0,1,11617,10.0,train,high school dxd,"comedy, demons, ecchi, harem, romance, school",tv,12,7.7,398660.0,high school dxd
1,1,11757,10.0,train,sword art online,"action, adventure, fantasy, game, romance",tv,25,7.83,893100.0,sword art online
2,1,15451,10.0,train,high school dxd new,"action, comedy, demons, ecchi, harem, romance,...",tv,12,7.87,266657.0,high school dxd new
3,2,11771,10.0,train,kuroko no basket,"comedy, school, shounen, sports",tv,25,8.46,338315.0,kuroko no basket
4,3,20,8.0,train,naruto,"action, comedy, martial arts, shounen, super p...",tv,220,7.81,683297.0,naruto


In [None]:
# Step 1: Split the 'genres' column into separate genres
df['genre'] = df['genre'].str.split(', ')

In [None]:
df_exploded_genres = df.explode('genre')

In [None]:
df_exploded_genres

In [None]:
del df

In [None]:
# Step 3: One-hot encode the genres
df_dummies = pd.get_dummies(df_exploded_genres['genre'], prefix='genre')

# Step 4: Group by original DataFrame index and sum the one-hot encoded values
df_final = df_exploded_genres.join(df_dummies).groupby(df_exploded_genres.index).sum()

# Optionally join with original DataFrame
df_final = df.join(df_final, rsuffix='_genre')

MemoryError: Unable to allocate 4.19 GiB for an array with shape (43, 104603154) and data type bool

---
<a href=#five></a>
## **Exploratory Data Analysis (EDA)**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Explore and visualize the data to uncover patterns, trends, and relationships.
* **Details:** Use statistics and visualizations to explore the data. This may include histograms, box plots, scatter plots, and correlation matrices. Discuss any significant findings.
---


In [None]:
#Please use code cells to code in and do not forget to comment your code.

---
<a href=#six></a>
## **Modeling**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Develop and train predictive or statistical models.
* **Details:** Describe the choice of models, feature selection and engineering processes, and show how the models are trained. Include code for setting up the models and explanations of the model parameters.
---


In [None]:
#Please use code cells to code in and do not forget to comment your code.

---
<a href=#seven></a>
## **Evaluation and Validation**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Evaluate and validate the effectiveness and accuracy of the models.
* **Details:** Present metrics used to evaluate the models, such as accuracy, precision, recall, F1-score, etc. Discuss validation techniques employed, such as cross-validation or train/test split.
---

In [None]:
#Please use code cells to code in and do not forget to comment your code.

---
<a href=#eight></a>
## **Final Model**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Present the final model and its performance.
* **Details:** Highlight the best-performing model and discuss its configuration, performance, and why it was chosen over others.
---


In [None]:
#Please use code cells to code in and do not forget to comment your code.

---
<a href=#nine></a>
## **Conclusion and Future Work**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Summarize the findings and discuss future directions.
* **Details:** Conclude with a summary of the results, insights gained, limitations of the study, and suggestions for future projects or improvements in methodology or data collection.
---


In [None]:
#Please use code cells to code in and do not forget to comment your code.

---
<a href=#ten></a>
## **References**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Provide citations and sources of external content.
* **Details:** List all the references and sources consulted during the project, including data sources, research papers, and documentation for tools and libraries used.
---

In [None]:
#Please use code cells to code in and do not forget to comment your code.

## Additional Sections to Consider

* ### Appendix:
For any additional code, detailed tables, or extended data visualizations that are supplementary to the main content.

* ### Contributors:
If this is a group project, list the contributors and their roles or contributions to the project.
