# F20DL Group 17 ML Portfolio

# Week 1
Overview: We shortlisted datasets to work on for the rest of the semester.

## Short-listed Tabular Datasets:
1. **Pokémon for Data Mining and Machine Learning** - [Kaggle](https://www.kaggle.com/datasets/alopez247/pokemon)
    - <font color='#90ee90'> 721 entries and 23 attributes for each entry, a mix of nominal and numerical data. </font>
    - <font color='#90ee90'> There are only 2 attributes with over 50% null values, can be easily cleaned
    and still lots of other attributes </font>
    - <font color='#90ee90'> The dataset is 732Kb, so it does not use a lot of space. </font>
   <br></br>
2. **Netflix Movies and TV Shows** - [Kaggle](https://www.kaggle.com/datasets/shivamb/netflix-shows)
    - <font color='#90ee90'> Pros:  8807 records with 12 attributes covering all data types (nominal/interval/ratio/ordinal data)</font>
    - <font color='#90ee90'> The dataset is 3.4MB.</font>
    - <font color='#FF7F7F'> 30% of records have null values for a certain attribute - field can be removed or records can be removed (leaving 6000 records) </font>
    <br></br>
3. **Video Game Sales** - [Kaggle](https://www.kaggle.com/datasets/gregorut/videogamesales)
    - <font color='#90ee90'>The dataset is comprehensive which consist of 16500+ records</font>
    - <font color='#90ee90'>This is a well known dataset with lots of papers and code</font>
    - <font color='#FF7F7F'>Has a limited number of attributes</font>

## Short-listed Computer Vision Datasets

4. **Fruits 360** - [Kaggle](https://www.kaggle.com/datasets/moltean/fruits)
    - <font color='#90ee90'>The dataset is comprehensive which consist of 90000+ high-quality images of over 100 different classes</font>
    - <font color='#90ee90'>The dataset consist of good quality, bad quality, and mixed quality fruit images</font>
    - <font color='#90ee90'>This is a well known dataset with lots of papers and code</font>
    - <font color='#90ee90'>The dataset has lots of training data which might result in better accuracy</font>
    - <font color='#FF7F7F'>The data might require GPUs for training due to the sheer size of the dataset</font>
    <br></br>
5. **Pokemon Image Dataset** - [Kaggle](https://www.kaggle.com/datasets/vishalsubbiah/pokemon-images-and-types)
    - <font color='#90ee90'>Has images of *all* of the Pokemon from generation 1 to 7</font>
    - <font color='#90ee90'>810 files/images to identify next evolution from the pre-evoled forms of the current Pokemon</font>
    - <font color='#90ee90'>Each Pokemon has two types, primary and secondary. The dataset helps predict the current type of the Pokemon image</font>
    - <font color='#FF7F7F'>Only 3 columns in the dataset, Pokemon, Type1, Type2</font>
    - <font color='#FF7F7F'>Type 2 has 50% null values, meaning half the Pokemons only have type1 which makes the identification pointless</font>
    - <font color='#FF7F7F'>Data is not uniform, the image resolutions are different which can result in conflicts during data analysis</font>


## Selected Dataset:
1. **Pokémon for Data Mining and Machine Learning** - [Kaggle](https://www.kaggle.com/datasets/alopez247/pokemon)
    - For nominal analysis
2. **Fruits 360** - [Kaggle](https://www.kaggle.com/datasets/moltean/fruits)
    - For any CNN related tasks.

# Week 2
Overview: We visualised and summarised the pokemon data, in order to be aquainted with it.

## 1. Imported required packages

In [None]:
# Some required imports
import sys
assert sys.version_info >= (3,5)    # Python >= 3.5 is required

In [None]:
import sklearn
assert sklearn.__version__ >= "0.2"

# Common imports
import numpy as np
import os
import tarfile
import urllib
import pandas as pd
import urllib.request

# To plot pretty figures
%matplotlib inline 
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt

## 2. We loaded the dataset, described the attributes, and generated summary statistics.

In [None]:
# Reads the CSV
pokemon = pd.read_csv('pokemon_alopez247.csv')
# This displays the top 5 entries, showing its 23 attributes
pokemon.head()

### Attribute Description of the Dataset

This database includes 721 Pokémon records from first six generations, with 23 attributes.

- **Number**: Unique identifier.
- **Name**: Pokémon name.
- **Type_1**: Primary type.
- **Type_2**: Second type, in case the Pokémon has it.
- **Total**: Sum of base stats (health points, attack, defense, special attack, special defense, and speed).
- **HP**: Base health points.
- **Attack**: Base attack.
- **Defense**: Base defense.
- **Sp_Atk**: Base special attack.
- **Sp_Def**: Base special defense.
- **Speed**: Base speed.
- **Generation**: Generation when the Pokémon was introduced. Ranges from 1 to 6.
- **isLegendary**: Boolean that indicates whether the Pokémon is Legendary or not.
- **Color**: Color of the Pokémon.
- **hasGender**: Boolean that indicates if the Pokémon can be classified as female or male.
- **Pr_male**: If the Pokémon has gender, the probability of being male. The probability of being female is 1 minus this value.
- **EggGroup1**: Egg group of the Pokémon.
- **EggGroup2**: Second egg group of the Pokémon, if it has two.
- **hasMegaEvolution**: If the Pokémon is able to Mega-evolve or not. Boolean value.
- **Height_m**: Pokémon height (m)
- **Weight_kg**: Pokémon weight (kg)
- **Catch_Rate**: Probability of the Pokémon being caught when a Pokéball is thrown at it.
- **Body_Style**: Body style of the Pokémon. E.g., Quadruped.

In [None]:
# Displays the type and number of nulls of each attribute
# Note: Almost 0.5 of Type 2 attributes are null, because Type 2 is not a required attribute is an optional add-on. 
pokemon.info()


In [None]:
# Generated summary statistics for the numerical attributes of the dataset
# It is not a complete dataset becuase there are some null values, which is to be dealt with in the next step.
pokemon.describe()

In [None]:
# Indicates where the null values are, within the dataset.
sns.heatmap(pokemon.isnull(), cbar=False)

## 3. Dealt with null values

We replaced the null values with another arbitrary value as they are meaningful.

Attributes will nulls and why they have nulls:

1. *Type_2* has around 50% null values as some Pokemons do not have a second type. Removing all the rows with this as null would reduce our dataset to 50% of the size. Removing this column ends the possibility of identifying and analysising the Pokemon's second type. Replacing this as a string called "None" solves our problem.

2. *Egg_Group_2* has around 75% null value as some Pokemons have only one egg group. Removing all the rows with this as null would reduce our dataset to 25% of the size. Removing this column again ends the possibility of indentifying and analysising the Pokemon's second egg group. Replacing this as a string called "None" solves our problem.

3. *Pr_Male* has around 11% null values, these Pokemons do not have a gender. Removing these rows or columns will cost us the possibility of identifying and predicting the Pokemon's gender. Replacing this with 999 signifies as the Pokemon as genderless.

In [None]:
# dealing with null values 

# changing null values of Type_2 to string "None"
pokemon['Type_2'].fillna("None", inplace = True)
# changing null values of Egg_Group_2 to string "None"
pokemon['Egg_Group_2'].fillna("None", inplace = True)
# changing null values of Pr_Male to 999
pokemon['Pr_Male'].fillna(999, inplace = True)

# Now there are no more null values.
pokemon.info()

## 4. Visualising again, now with a complete dataset

In [None]:
# Generating a box plot to view the numerical data

plt.figure(figsize = (30, 10))
# Taking all numerical data to plot
num_data = pokemon[['Total', 'HP', 'Attack', 'Defense', 'Sp_Atk', 'Sp_Def', 'Speed', 'Generation', 'Height_m', 'Weight_kg', 'Catch_Rate']]
# Generating a box plot to visualize the statistical summary 
sns.boxplot(data = num_data)

## 5. Dealing with Categorical Data

We will convert the categorical data into numerical. Why? (Someone answer this)


In [None]:
# The data before:
categorical_attributes = ['Type_1', 'Type_2', 'Egg_Group_1', 'Egg_Group_2', 'Color', 'Body_Style', 'isLegendary', 'hasMegaEvolution', 'hasGender']
pokemon[categorical_attributes].head()

In [None]:
# Changing the categorical data to numbers.
from sklearn.preprocessing import OrdinalEncoder

enc = OrdinalEncoder()
pokemon[categorical_attributes] = enc.fit_transform(pokemon[categorical_attributes]).astype(int)

In [None]:
# Now they are all numerical
pokemon.info()

In [None]:
# The data after:
pokemon[categorical_attributes].head()

## 6. Dropping Unecessary Features

In [None]:
# Number and name are unnecessary because they will not help us classify the target attributes
pokemon = pokemon.drop(['Number','Name'], axis=1)
pokemon.head()

## 7. Plotting the Correlation Matrix
We want to see the correlation between each attributes

In [None]:
# We want to look at the correlations between each attribute

mask = np.array(pokemon.corr())
mask[np.tril_indices_from(mask)] = False

fig, ax = plt.subplots()
fig.set_size_inches(20,10)
sns.heatmap(pokemon.corr(), mask = mask, annot=True, cmap='viridis', linewidths=.5)

# Week 3:

#### Binary Classification Function

In [None]:
from sklearn.preprocessing import StandardScaler
# Logistic regression to the training data
from sklearn.linear_model import LogisticRegression
# Creates the confusion matrix
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split

def binaryClassification(X, y):

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=10)
    
    ss_train_test = StandardScaler()
    logisticRegr = LogisticRegression() 
    logisticRegr.fit(ss_train_test.fit_transform(X_train), y_train)
    
    # Predicting based off of the test data
    predictions = logisticRegr.predict(ss_train_test.fit_transform(X_test))

    cm = confusion_matrix(y_test, predictions)

    TN, FP, FN, TP = confusion_matrix(y_test, predictions).ravel()

    print('True Positive(TP)  = ', TP)
    print('False Positive(FP) = ', FP)
    print('True Negative(TN)  = ', TN)
    print('False Negative(FN) = ', FN)

    # Accuracy of the classifier
    accuracy =  (TP + TN) / (TP + FP + TN + FN)

    print('Accuracy of the binary classifier = {:0.3f}'.format(accuracy))


    

In [None]:
#Tried min max scaler instead of standard Scaler :
#Min max scaler got a lower answer, and had more false trues, than the StandardScaler..

# from sklearn import preprocessing
# min_max_scaler = preprocessing .MinMaxScaler()

# X_train_Binary = min_max_scaler.fit_transform(X_train)
# X_test_Binary = min_max_scaler.fit_transform(X_test)


### Binary Classification on Original Data (Without Feature Extraction)


In [None]:
# Original data without any feature extraction:
y = pokemon['isLegendary']
# Has all attributes besides isLegendary,because it is the class attri (shouldn't be normalized)
X = pokemon.drop('isLegendary', axis =1 )

binaryClassification(X, y)

## Pearson's R Feature Filtering

In [None]:
corr_matrix = pokemon.corr() # computes the standard correlation coefficient (Pearson’s r) between every pair of attributes
top_corr = corr_matrix["isLegendary"].sort_values(ascending=False)
top_corr = top_corr.sort_values(ascending=False).drop('isLegendary')
top_corr

In [None]:
from pandas.plotting import scatter_matrix

# getting top 5 correlating attributes and visualising it
attributes = top_corr.index[:5].tolist()
scatter_matrix(pokemon[attributes], figsize=(12, 8))

In [None]:
# Selecting the top 9 features and then spliting them up to 3 ....
 

pokemon_pearsons_r = pokemon[top_corr.index[:3].tolist()]
pokemon_pearsons_r_2 = pokemon[top_corr.index[3:6].tolist()]
pokemon_pearsons_r_3 = pokemon[top_corr.index[-3:].tolist()]

print(pokemon_pearsons_r.columns.to_list())
print(pokemon_pearsons_r_2.columns.to_list())
print(pokemon_pearsons_r_3.columns.to_list())

In [None]:
attributes = top_corr.index[:5].tolist()
scatter_matrix(pokemon[attributes], figsize=(12, 8))

### Binary Classification on Pearsons R data (1st Feature Extraction)


<h5>Binary Classification Using the Top 3 Features</h5>

In [None]:
#Features extracted using pearsons r..
y = pokemon['isLegendary']
# Has only the top 3 features that has been extracted using pearsons r.
X = pokemon_pearsons_r
print("features that are being used: ", X.keys().tolist())
print("---------")
binaryClassification(X,y)
#mutual information
#chi square.

<h5>Binary Classification Using the Top 7 Features</h5>

In [None]:
#Features extracted using pearsons r..
y = pokemon['isLegendary']
# Has only the top 5 features that has been extracted using pearsons r.
X = pokemon_pearsons_r_2
print("features that are being used: ", X.keys().tolist())
print("---------")
binaryClassification(X,y)

<h5>Binary Classification Using the Top 10 Features</h5>

In [None]:
# Features extracted using pearsons r..
y = pokemon['isLegendary']
# Has only the top 7 features that has been extracted using pearsons r.
X = pokemon_pearsons_r_3

print("features that are being used: ", X.keys().tolist())
print("---------")
binaryClassification(X,y)

## Embedded Methods

In [None]:
# This is wrong, this is using the normal data (all attrib in X), instead of actually using the extracted attri in X.

y = pokemon['isLegendary']
# Has all attributes besides isLegendary,because it is the class attri (shouldn't be normalized)
X = pokemon.drop('isLegendary', axis =1 )

X_train_embedded, X_test_embedded, y_train_embedded, y_test_embedded = train_test_split(X, y, test_size=0.3, random_state=10)


In [None]:
# train model using lasso
from sklearn.linear_model import Lasso
lasso = Lasso()
lasso.fit(X_train_embedded, y_train_embedded)

# perform feature selection
pokemon_embedded_methods = [feature for feature, weight in zip(X.columns.values, lasso.coef_) if weight != 0]
print("Features that have been selected are: ",pokemon_embedded_methods)
print("---------")
binaryClassification(pokemon[pokemon_embedded_methods],y)


#### Applying PCA:

In [None]:
from sklearn.decomposition import PCA

y = pokemon['isLegendary']
# Has all attributes besides isLegendary,because it is the class attri (shouldn't be normalized)
X = pokemon.drop('isLegendary', axis =1 )

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=10)

sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

pca = PCA()
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)
exp_variance = pca.exp_variance_ratio_

cm = confusion_matrix(y_test, exp_variance)

TN, FP, FN, TP = confusion_matrix(y_test, exp_variance).ravel()

print('True Positive(TP)  = ', TP)
print('False Positive(FP) = ', FP)
print('True Negative(TN)  = ', TN)
print('False Negative(FN) = ', FN)

# Accuracy of the classifier
accuracy =  (TP + TN) / (TP + FP + TN + FN)

print('Accuracy of the binary classifier = {:0.3f}'.format(accuracy))






## Week 3 Conclusions
- What kind of information did you learn, as a result of the above experiments?
- What features are more important/reliable for the class? Less reliable?