# Introduction to Data Visualization and Exploration
# Assignment 1

## Group Members:
#### Khanyisile Sixhaxa: 1590202
#### Tsireledzo Ravelle: 1821249
#### Keletso Pule:
#### Charlotte Savage: 1079415

# 1.1.1 Data cleaning and outliers

There is a popular dataset (it is an older dataset but it checks out) that contains information on glass identification. There are 214 glass samples split amongst seven class categories and nine features, including
the refractive index and the content in percent of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.

In [None]:
import pandas as pd
import seaborn as sns
import numpy as np
import math
sns.set_theme(style="white")
import scipy.cluster.hierarchy as shc
import matplotlib.pyplot as plt

## Attribute Information
1. Id number: 1 to 214
2. RI: refractive index
3. Na: Sodium (unit measurement: weight percent in corresponding oxide, as are attributes 4-10)
4. Mg: Magnesium
5. Al: Aluminum
6. Si: Silicon
7. K: Potassium
8. Ca: Calcium
9. Ba: Barium
10. Fe: Iron
11. Type of glass: (class attribute)
  - 1: building_windows_float_processed
  - 2: building_windows_non_float_processed
  - 3: vehicle_windows_float_processed
  - 4: vehicle_windows_non_float_processed (none in this database)
  - 5: containers
  - 6: tableware
  - 7: headlamps

In [None]:
columns= ['Id', 'RI', 'Na', 'Mg', 'Al', 'Si', 'K', 'Ca', 'Ba', 'Fe', 'GlassType']
df = pd.read_csv('glass.data', names=columns)

In [None]:
df

In [None]:
df.describe()

#### 1. Using visualisations, explore the feature variables to understand their distributions as well as the relationships between predictors. Here, include histograms, bar charts, correlation heatmaps, etc.

In [None]:
#Get the total number of each glass type
print("GlassType   Total")
df['GlassType'].value_counts()

In [None]:
plt.style.use("seaborn")
fig, ax =plt.subplots(figsize=(8,6))
sns.countplot(x = df["GlassType"]);
plt.title("Glass Types Distribution",fontsize=15, y=1.03);

In [None]:
plt.style.use("seaborn")
fig, ax = plt.subplots(figsize=(10,8))

plt.pie(x=df["GlassType"].value_counts(), 
        labels=["Glass Type 2", "Glass Type 1", "Glass Type 7", "Glass Type 3", "Glass Type 5", "Glass Type 6"],
        shadow = True, 
        autopct="%1.2f%%", 
        )
plt.title("Glass Types Distribution Pie Chart",fontsize=15)
plt.show()

In [None]:
# Let’s plot the distribution of each feature
def plot_distribution(dataset, cols=5, width=20, height=15, hspace=0.2, wspace=0.5):
    plt.style.use('seaborn-whitegrid')
    fig = plt.figure(figsize=(width,height))
    fig.subplots_adjust(left=None, bottom=None, right=None, top=None, wspace=wspace, hspace=hspace)
    rows = math.ceil(float(dataset.shape[1]) / cols)
    for i, column in enumerate(dataset.columns):
        ax = fig.add_subplot(rows, cols, i + 1)
        ax.set_title(column)
        if dataset.dtypes[column] == np.object:
            g = sns.countplot(y=column, data=dataset)
            substrings = [s.get_text()[:18] for s in g.get_yticklabels()]
            g.set(yticklabels=substrings)
            plt.xticks(rotation=25)
        else:
            g = sns.distplot(dataset[column])
            plt.xticks(rotation=25)

In [None]:
cols_to_plot = ['RI', 'Na', 'Mg', 'Al', 'Si', 'K', 'Ca', 'Ba', 'Fe', 'GlassType']
plot_distribution(df[cols_to_plot], cols=3, width=20, height=20, hspace=0.45, wspace=0.5)

In [None]:
sns.pairplot(df[['RI', 'Na', 'Mg', 'Al', 'Si', 'K', 'Ca', 'Ba', 'Fe', 'GlassType']],hue='GlassType')

In [None]:
features = ['RI','Na','Mg','Al','Si','K','Ca','Ba','Fe']
label = ['GlassType']

In [None]:
plt.figure(figsize=(15,15))
sns.heatmap(df.corr(),annot = True)

Above diagrams shows that our dataset is skewed either on positive or negative side, and the data is not normalized. The heatmap diagram shows that refractive index (RI) and Calcium (Ca) have the strongest correlation between them, Aluminum (Al) and Barium (Ba) have intermediate correlation between them , and, Magnesium (Mg) and Barium (Ba) have the weakest correlation between them. 

#### 2. Can you find any outliers? Are any of the distributions of the features skewed?

In [None]:
for column in df:
    plt.figure()
    df.boxplot([column], vert = False)

We can see from the box and whisker diagrams that in the Iron (Fe), Barium (Ba), Calcium (Ca), Pottasium (K), Silicon (Si), Aluminum (Al), Sodium (Na) and Refractive index (RI) columns there are a lot of outliers. Furthermore, the Magnesium (Mg) column does not show any oulier(s). Above diagrams shows that our dataset is skewed either on positive or negative side, and the data is not normalized.
1. refractive index (RI):  Right-skewed distribution (Positive)
2. Sodium (Na): Right-skewed distribution (Positive)
3. Magnesium (Mg): Left-skewed distribution (Negative)
4. Aluminum (Al): Right-skewed distribution (Positive)
5. Silicon (Si): Left-skewed distribution (Negative)
6. Potassium (K): Left-skewed distribution (Negative)
7. Calcium (Ca): Right-skewed distribution (Positive)
8. Barium (Ba): Right-skewed distribution (Positive)
9. Iron (Fe): Right-skewed distribution (Positive)

#### 3. What types of transformations of one (or more) of these features might improve the classification model?

Normalization

# 1.1.2 Data cleaning and missing values

# 1.1.3 Feature Selection and Engineering

# 1.1.4 Dimensionality Reduction
Use the penguin dataset for this section. For this piece do not worry about cleaning the data or looking for missing values.


In [None]:
pip install palmerpenguins

In [None]:
import pandas as pd
import seaborn as sns 
from palmerpenguins import load_penguins
sns.set_style('whitegrid')
import matplotlib.pyplot as plt

In [None]:
penguin = load_penguins()
penguins= penguin.replace( {'Adelie':0 , 'Gentoo':1, 'Chinstrap':2,'Torgersen':0, 'Dream': 1,'Biscoe':2, 'female':0,'male':1} )
penguins = penguins.fillna(0);
penguins.columns

In [None]:
penguins_data = penguins[['island','bill_length_mm','bill_depth_mm','flipper_length_mm','body_mass_g','sex','year']]
penguins_target = penguins.species

#### 1. Perform PCA with 2 and then 4 components. Show the explained variance for the different PCs.

#### PCA with 2 components

In [None]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import pandas as pd
scaler = StandardScaler()

pca2 = PCA(n_components=2)
x_pca2 = pca2.fit_transform(scaler.fit_transform(penguins_data))

#### PCA with 4 components

In [None]:
pca4 = PCA(n_components=4)
# raw data
x_pca4_raw = pca4.fit_transform(penguins_data)

# standardised data 
x_pca4_std = pca4.fit_transform(scaler.fit_transform(penguins_data))

#### Explained Variance

In [None]:
import seaborn as sns
def plot_cumul_var(pcamodel):
    plt.bar(range(1,len(pcamodel.explained_variance_ )+1),pcamodel.explained_variance_ )
    plt.ylabel('Explained variance')
    plt.xlabel('Components')
    plt.plot(range(1,len(pcamodel.explained_variance_ )+1),
             np.cumsum(pcamodel.explained_variance_),
             c='red',
             label="Cumulative Explained Variance")
    plt.legend(loc='upper left')
    plt.show()


def plot_expl_var_ratio(pcamodel):
    plt.plot(pcamodel.explained_variance_ratio_)
    plt.xlabel('number of components')
    plt.ylabel('cumulative explained variance')
    plt.show()

#PCA1 is at 0 in xscale

def plot_expl_variance(pcamodel):
    plt.plot(pcamodel.explained_variance_)
    plt.xlabel('number of components')
    plt.ylabel('cumulative explained variance')
    plt.show()

def plot_heatmap(pcamodel, columns):
    ax = sns.heatmap(pcamodel.components_,
                     cmap='YlGnBu',
                     yticklabels=[ "PCA"+str(x) for x in range(1,pcamodel.n_components_+1)],
                     xticklabels=columns,
                     cbar_kws={"orientation": "horizontal"})
    ax.set_aspect("equal")
    plt.show()

#### (i) Explained variance for PCA with 2 compnents

In [None]:
# plot variance ratio
plot_expl_var_ratio(pca2)
# plot a heatmap showing which variables are contributing to each PC
plot_heatmap(pca2, list(['island','bill_length_mm','bill_depth_mm','flipper_length_mm','body_mass_g','sex','year']))
print('Total Variance Captured by Principle Components: {0}%'.format(pca2.explained_variance_ratio_.sum()*100.))

#### (ii) Explained Variance for PCA with 4 components 

In [None]:
# plot variance ratio
plot_expl_var_ratio(pca4)
# plot a heatmap showing which variables are contributing to each PC
plot_heatmap(pca4, list(['island','bill_length_mm','bill_depth_mm','flipper_length_mm','body_mass_g','sex','year']))
print('Total Variance Captured by Principle Components: {0}%'.format(pca4.explained_variance_ratio_.sum()*100.))

#### 2. Then, for the PCA with 4 components, make a scatterplot for the first two principle components for a) the raw data and b) standardised data. What do you notice about these different plots?

#### Scatterplot of the first two principle components using the raw data

In [None]:
plt.scatter(x_pca4_raw[:, 0], x_pca4_raw[:, 1], c=penguins.species, alpha=0.8, marker='o')
plt.show()

#### Sactterplot of the first two principle components using the standardised data

In [None]:
plt.scatter(x_pca4_std[:, 0], x_pca4_std[:, 1], c=penguins.species, alpha=0.8, marker='o')
plt.show()

#### What do you notice about these different plots?
From the scatterplot of the raw data, we can see that we were not able to reduce 8 features into 4 features whereas for the standardised data the 8 features were reduced to 4.