# Group Project

### Project Title: 2403PTDS_Unsupervised_Project
Analysing News Articles Dataset
#### Done By: Classification Project Team (K Ebrahim, J Sithole, J Maleka, S Tlhale, N Mhlophe & M Majola)

© ExploreAI 2024

---

## Table of Contents

<a href=#BC> Project Overview</a>

<a href=#one>1. Dataset</a>

<a href=#two>2. Packages</a>

<a href=#three>3. Environment </a>

<a href=#four>4. Data Cleaning and Filtering</a>

<a href=#five>5. Exploratory Data Analysis (EDA)</a>

<a href=#six>6. Model Building and Evaluation</a>

<a href=#seven>7. MLFlow </a>

<a href=#seven>7. Streamlit </a>

<a href=#eight>8. Conclusion and Future Work</a>

<a href=#nine>9. Team Members</a>

---
 <a id="BC"></a>
## **Background Context**
<a href=#cont>Back to Table of Contents</a>

* **Project Overview:**

  This project focuses on building a hybrid anime recommendation system by combining collaborative and content-based filtering techniques.
  Leveraging     an anime dataset from myanimelist.net, the system will predict user preferences and recommend relevant anime titles.  A key aspect of    this project,     as it's framed as unsupervised, will involve exploring unsupervised learning techniques for aspects like user or item clustering,
  which can enhance     the recommendation process.  This mimics how platforms like Netflix and others provide tailored recommendations.

* **Purpose:** 
    * Personalized Recommendations: To provide users with personalized anime recommendations based on their viewing history and preferences.
    * Content Discovery: To facilitate the discovery of new and relevant anime titles that users might not otherwise find.
    * Improved User Experience: To enhance the user experience on anime platforms by streamlining the content selection process.
    * Unsupervised Learning Exploration: To investigate the application of unsupervised learning methods (e.g., clustering, dimensionality reduction)
      for improving recommendation system performance.
    * Data Analysis: To analyze user behavior and preferences related to anime consumption, potentially revealing insights through unsupervised methods.


* **Details:**

  * Dataset: The project will use a dataset from myanimelist.net containing user ratings for a large collection of anime titles. Data preprocessing and     feature engineering will be crucial, especially considering the potential for unsupervised learning.
    
  * Collaborative Filtering: This approach will identify users with similar viewing habits. Unsupervised learning can be used here, for instance, to
    cluster users based on their rating patterns before applying collaborative filtering.
      
  * Content-Based Filtering: This approach will analyze anime characteristics (genre, themes, studio, synopsis). Unsupervised learning can be applied       to extract relevant features or group similar anime titles.
 
    
  * Hybrid Approach: The project will combine collaborative and content-based filtering, possibly using unsupervised learning results to inform the
    hybrid strategy.
 
    
   * Unsupervised Learning Component: This is the core of the "unsupervised" aspect. Techniques like K-Means clustering for user segmentation, or
    PCA/t-  SNE for dimensionality reduction of anime features, will be explored.
 
     
   * Evaluation Metrics: Performance will be evaluated using metrics like precision, recall, F1-score, RMSE, and potentially metrics specific to
     evaluating the quality of clusters or reduced dimensions.
 
     
   * Technology Stack: Python with libraries like Pandas, NumPy, Scikit-learn, Surprise, and potentially others for visualization or specific
      unsupervised learning algorithms.

    **1. Objective:**

    * Data Preprocessing & Feature Engineering: Clean the data, handle missing values, and engineer features suitable for both supervised and
     unsupervised learning.

    * Unsupervised Learning Implementation: Apply unsupervised learning techniques (clustering, dimensionality reduction) to user ratings and/or anime        features.
 
    * Model Development: Implement collaborative, content-based, and hybrid recommendation models, integrating results from unsupervised learning.
 
    * Performance Evaluation: Evaluate the performance of the models, focusing on the impact of the unsupervised learning component.
 
    * Optimization: Fine-tune models and unsupervised learning parameters to maximize recommendation accuracy.
 
    * Recommendation Generation: Develop a system to generate personalized recommendations.
 
    * Documentation & Reporting: Document the project thoroughly, including the rationale behind unsupervised learning choices, implementation details,       evaluation results, and insights.

    **2. Deliverables:**

    * Preprocessed Dataset: Cleaned and prepared dataset, including any features created specifically for unsupervised learning.
    * Unsupervised Learning Models: Code implementing the chosen unsupervised learning techniques (e.g., clustering, dimensionality reduction).
    * Recommendation Models: Code implementing the collaborative, content-based, and hybrid recommendation models.
    * Evaluation Report: A report detailing the evaluation metrics and a comparative analysis of the different approaches, emphasizing the contribution       of unsupervised learning.
    * Recommendation System: A working system (script, API, or simple application) to generate recommendations.
    * Code Repository: A well-organized and documented repository containing all code, data, and reports.


    **3. Expected Outcomes:**

    * Functional Recommendation System: A working anime recommendation system that provides personalized recommendations.
    * Demonstrated Use of Unsupervised Learning: A clear demonstration of how unsupervised learning techniques can be applied to enhance a
      recommendation system.
    * Performance Analysis: A thorough analysis of the impact of unsupervised learning on recommendation performance.
    * Insights: Potential insights gained from the data through unsupervised learning, such as user segments or anime feature relationships.
    * Documented Project: A well-documented and reproducible project.

---
<a href=#two></a>
## **Data Collection and Description**
<a href=#cont>Back to Table of Contents</a>

The dataset is comprised of news articles that need to be classified into categories based on their content, including Business, Technology, Sports, Education, and Entertainment. You can find both the train.csv and test.csv

**Dataset Features:**

**Headlines**      The headline or title of the news article.

**Description**    A brief summary or description of the news article.

**Content**        The full text content of the news article.

**URL**            The URL link to the original source of the news article.

**Category**      The category or topic of the news article (e.g., business, education, entertainment, sports, technology).








* **Purpose:** Describe how the data was collected and provide an overview of its characteristics.
* **Details:** Mention sources of the data, the methods used for collection (e.g., APIs, web scraping, datasets from repositories), and a general description of the dataset including size, scope, and types of data available (e.g., numerical, categorical).
---

* **Installing and Importing Packages**

**Purpose:** Set up the Python environment with necessary libraries and tools.

**Details:** List and import all the Python packages that will be used throughout the project such as Pandas for data manipulation, Matplotlib/Seaborn for visualization, scikit-learn for modeling, etc.

In [2]:
# Uncomment the next line to install required packages
#!pip install pandas matplotlib seaborn numpy scikit-learn nltk collections2 wordcloud textblob joblib mlflow ipython

Collecting collections2
  Downloading collections2-0.3.0.tar.gz (3.3 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Collecting textblob
  Downloading textblob-0.19.0-py3-none-any.whl.metadata (4.4 kB)
Downloading textblob-0.19.0-py3-none-any.whl (624 kB)
   ---------------------------------------- 0.0/624.3 kB ? eta -:--:--
   --------------------------------------- 624.3/624.3 kB 11.3 MB/s eta 0:00:00
Building wheels for collected packages: collections2
  Building wheel for collections2 (setup.py): started
  Building wheel for collections2 (setup.py): finished with status 'done'
  Created wheel for collections2: filename=collections2-0.3.0-py3-none-any.whl size=5341 sha256=d9c53e46c7f653a456d91631f75a0f52499dabbb8ee6669d33fcaf7aa4df187a
  Stored in directory: c:\users\kgerr\appdata\local\pip\cache\wheels\d0\45\a9\c681e061c5e021c436b342cadba9bd54e7d1b766eb4ac2aeab
Successfully built collections2
Installing collected packages: c

In [3]:

import pandas as pd    # For data manipulation and analysis.
import matplotlib.pyplot as plt # For creating static, interactive, and animated visualizations
import seaborn as sns  # Built on top of Matplotlib, Seaborn simplifies the creation of attractive and informative statistical graphics.
import numpy as np  # For numerical computations.
import re  # For regular expressions, used in text data preprocessing to search, match, and manipulate strings.
import string # Provides constants like string.punctuation and utility functions for working with strings.
from sklearn.feature_extraction.text import TfidfVectorizer # Converts text data into numerical feature vectors using the Term Frequency-Inverse Document Frequency (TF-IDF) technique.
from sklearn.model_selection import train_test_split # Splits the dataset into training and testing subsets.
from sklearn.linear_model import LogisticRegression # Implements Logistic Regression, a commonly used supervised learning algorithm for classification tasks.
from sklearn.naive_bayes import MultinomialNB # Implements the Multinomial Naive Bayes algorithm, often used for text classification.
from sklearn.svm import SVC # Implements Support Vector Classifier (SVC), a powerful supervised learning model for classification.
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix # Provides functions to evaluate the performance of machine learning models
from sklearn.model_selection import GridSearchCV # Performs hyperparameter tuning by evaluating a grid of parameter combinations using cross-validation.
from sklearn.model_selection import cross_val_score # Computes cross-validated scores for assessing model performance.
import joblib # Used for saving and loading machine learning models or large datasets efficiently.
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from collections import Counter
from wordcloud import WordCloud
from textblob import TextBlob
import joblib
import mlflow
import mlflow.pyfunc
from IPython.display import display
import warnings
# Hide all future warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

---
<a href=#three></a>
## **Loading Data**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Load the data into the notebook for manipulation and analysis.
* **Details:** Show the code used to load the data and display the first few rows to give a sense of what the raw data looks like.
---

In [5]:
# Load the datasets
df_train = pd.read_csv('train.zip')
df = pd.read_csv('anime.csv')
df_test = pd.read_csv('test.csv')
df_submit = pd.read_csv('submission.csv')

# Display the first few rows of the train dataset
print(df_train.head())

# Display the first few rows of the test dataset
print(df_test.head())

   user_id  anime_id  rating
0        1     11617      10
1        1     11757      10
2        1     15451      10
3        2     11771      10
4        3        20       8
   user_id  anime_id
0    40763     21405
1    68791     10504
2    40487      1281
3    55290       165
4    72323     11111


---
<a href=#four></a>
## **Data Cleaning and Filtering**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Prepare the data for analysis by cleaning and filtering.
* **Details:** Include steps for handling missing values, removing outliers, correcting errors, and possibly reducing the data (filtering based on certain criteria or features).
---

---
<a href=#five></a>
## **Exploratory Data Analysis (EDA)**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Explore and visualize the data to uncover patterns, trends, and relationships.
* **Details:** Use statistics and visualizations to explore the data. This may include histograms, box plots, scatter plots, and correlation matrices. Discuss any significant findings.
---


In [2]:
# Check for missing values in the train dataset

# Check for missing values in the test dataset


In [3]:
# Basic statistics of the train and test dataset


---
<a href=#nine></a>
## **Model Building and Evaluation**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Summarize the findings and discuss future directions.
* **Details:** Conclude with a summary of the results, insights gained, limitations of the study, and suggestions for future projects or improvements in methodology or data collection.
---


# 
**1. Text Preprocessing:**

* Replace the __Content__ column with the __Cleaned_Content__ column in the train_df.
* Drop the __Category__ column in the test_df

In [4]:
# Tokenization and Vectorization using TF-IDF


In [5]:
# Data ssssssss


## 4. Final Model Evaluation


In [6]:
# Perform cross-validation on the final SVM model


In [7]:
# Evaluate the final model on the test set



In [8]:
# Save the final model for deployment


# Deploy the Model

## a. Set Up the Environment
Ensure that the deployment environment has the necessary dependencies and configurations. This includes:

- Python environment (e.g., virtualenv or conda)
- Required libraries (e.g., scikit-learn, joblib, pandas, numpy)

You can create a `requirements.txt` file to list all the dependencies:

In [9]:
# Load the model and vectorizer.


### MLFLOW

---
<a href=#nine></a>
## **Conclusion and Future Work**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Summarize the findings and discuss future directions.
* **Details:** Conclude with a summary of the results, insights gained, limitations of the study, and suggestions for future projects or improvements in methodology or data collection.
---

In [8]:
#Please use code cells to code in and do not forget to comment your code.

## Additional Sections to Consider

* ### Appendix: 
For any additional code, detailed tables, or extended data visualizations that are supplementary to the main content.

* ### Collaborators: 
  - Emmanuel Majola
  - Selogile Tlhale
  - Josia Sithole
  - Kyle Ebrahim
  - Jerry Maleka
  - Nhlokomo Mhlophe
