# AI Environment Creation and Testing

## Instructions

This is a Group assignment centered on showcasing your technical creativity skills in creating a functional or running environment for implementing AI experiments. The AI implementation environment is to be created on a single node with below mentioned libraries and tools. This will simulate an on-premise hardware environment which will be your machine. This environment should be running specifically on Linux platform (use any preferable distribution and those with Windows Operating System are expected virtualize their environment for Linux installation):

1. Latest Anaconda installation.
2. Latest version of Python
3. Jupyter Notebook
4. TensorFlow
5. Keras
6. NumPy
7. SciPy
8. Matplotlib
9. Pandas
10. Scikit-Learn
11. Other.

### After creating your environment, you should be able to provide:

1. A clear demonstration and explanation of the above mentioned libraries and tools through a specific dataset of your choice. You are free to use any datasets within an African context to provide proper data manipulation operations and procedures as a way of demonstrating functionality of your created environment.

2. Necessary comments on your Notebook.

3. Documentation of your work.

## 1. Install Linux Platform

This Group assignment will simulate an on-premise hardware environment which will be your machine. 
This environment should be running specifically on Linux platform.

In [1]:
# Linux Installation

# lsb_release script gives information about the Linux Standards Base (LSB) status of the distribution
# -d gives the description of this distribution
!lsb_release -d

Description:	Ubuntu 18.04.5 LTS


## 2. Install Required Tools and Libraries

The AI implementation environment is to be created on a single node with below mentioned libraries and tools.

### 2.1 Environment checklist

In [2]:
# Latest Anaconda Installation

# conda list will list all packages in the current environment
# to check which anaconda version we have installed, we add anaconda$ that will return only the package named "anaconda"
!conda list anaconda$

/bin/bash: conda: command not found


In [3]:
# Latest Version of Python

!python3 -V

Python 3.8.0


In [4]:
# Jupyter Notebook Installation

!jupyter --version

jupyter core     : 4.7.0
jupyter-notebook : 6.2.0
qtconsole        : 5.0.1
ipython          : 7.16.1
ipykernel        : 5.4.3
jupyter client   : 6.1.11
jupyter lab      : not installed
nbconvert        : 6.0.7
ipywidgets       : 7.6.3
nbformat         : 5.1.2
traitlets        : 4.3.3


### 2.2 Libraries checklist

After creating your environment, you should be able to provide:

* A clear demonstration and explanation of the above mentioned libraries and tools through a specific dataset of your choice.

In [5]:
# Check the versions of libraries

# TensorFlow Installation
import tensorflow
print("Tensorflow: {}".format(tensorflow.__version__))

# Keras Installation
import keras
print("Keras: {}".format(keras.__version__))

# Numpy Installation
import numpy
print("Numpy: {}".format(numpy.__version__))

# SciPy Installation
import scipy
print('Scipy: {}'.format(scipy.__version__))

# Matplotlib Installation
import matplotlib
print('Matplotlib: {}'.format(matplotlib.__version__))

# Pandas Installation
import pandas
print('Pandas: {}'.format(pandas.__version__))

# Scikit-Learn Installation
import sklearn
print('Sklearn: {}'.format(sklearn.__version__))

ModuleNotFoundError: No module named 'tensorflow'

## 3. Load The Data

You are free to use any datasets within an African context to provide proper data manipulation operations and procedures as a way of demonstrating functionality of your created environment.

The dataset we will use contains information about Covid-19 death cases in Africa, per country, per day from the beginning of the pandemic.

* Dataset link for download or access: https://data.humdata.org/dataset/africa-covid-19-death-cases


Open an editing session for the project, then choose the file you want to upload. 

* In Jupyter Notebook, click Upload and select the file to upload. 
* Then click the blue Upload button displayed in the file’s row to add the file to the project

In [None]:
# Download and upload excel file 

# Listing files in working directory
%ls

Once a file is in the project, you can use code to read it.

### 3.1 Import Libraries

In [None]:
# Importing libraries

import pandas as pd
import numpy as np
from tensorflow import keras
from tensorflow.keras import layers
import matplotlib.pyplot as plt
%matplotlib inline

### 3.2 Load Dataset

We are using pandas to load the data. We will also use pandas next to explore the data both with descriptive statistics and data visualization.

In [None]:
# Using Pandas


# Loading the covid deaths dataset from an excel file into a pandas DataFrame
covid_df = pd.read_excel("covid19_africa_deceased_hera.xlsx", index_col=0, header=2)

## 4. Summarize the Dataset

In this step we are going to take a look at the data a few different ways:

1. Dimensions of the dataset.
2. Peek at the data itself.
3. Statistical summary of all attributes.
4. Breakdown of the data by the class variable.

### 4.1 Dimensions of Dataset

We can get a quick idea of how many instances (rows) and how many attributes (columns) the data contains with the shape property.

In [None]:
# shape stores the number of rows and columns as a tuple (number of rows, number of columns) 
covid_df.shape 

### 4.2 Peek at the Data

In [None]:
# Listing column variable names

# returns the column labels of the given Dataframe
print(covid_df.columns) 

In [None]:
# Check first 15 rows of data
covid_df.head(15)

In [None]:
# returns a Series with the data type of each column
covid_df.dtypes 

### 4.3 Statistical Summary

This includes the count, mean, the min and max values as well as some percentiles.

In [None]:
# Descriptions of each variable

# describe() shows some basic statistical details like percentile, mean, std. of a data frame or a series of numeric values.
covid_df.describe() 

In [None]:
# Summary statistics (minimum, maximum, mean, median, percentiles)

print('min:', covid_df['10/11/2020'].min())
print('max:', covid_df['10/11/2020'].max())
print('mean:', covid_df['10/11/2020'].mean())
print('median:', covid_df['10/11/2020'].median())
print('50th percentile:', covid_df['10/11/2020'].quantile(0.5)) # 50th percentile, also known as median
print('5th percentile:', covid_df['10/11/2020'].quantile(0.05))
print('10th percentile:', covid_df['10/11/2020'].quantile(0.1))
print('95th percentile:', covid_df['10/11/2020'].quantile(0.95))

In [None]:
# Using value_counts function to show the number of items

print(covid_df['10/11/2020'].value_counts(ascending=True)) # list value counts in ascending order

## 5. Data Visualization

We now have a basic idea about the data. We need to extend that with some visualizations.

We are going to look at two types of plots:

1. Histograms
2. Bar chart

### 5.1 Histograms

We can create a histogram to get an idea of the distribution.

In [None]:
# Using Matplotlib


# Plotting a histogram of the 10/11/2020 column values in the Covid-deaths data set
plt.hist(covid_df['10/11/2020'])
plt.show()

### 5.2 Bar charts

In [None]:
# This produces a bar chart of the 10/11/2020 covid-deaths in the data set.
day = covid_df['10/11/2020'].value_counts()
fig, ax = plt.subplots()


# Using Numpy


# np.arange() returns evenly spaced values within lowest and highest value counts in 10/11/2020
# bar() makes the bar plot
ax.bar(np.arange(len(day)), day)  
ax.set_xlabel('Day')
ax.set_ylabel('Covid Deaths')
ax.set_title('Covid Deaths on 10/11/2020')

# set_xticks() on the axes will set the data points
# set_xticklabels() will set the displayed text
ax.set_xticks(np.arange(len(day)))
ax.set_xticklabels(day.index)

# display Bar chart
plt.show()

This example was adapted from https://matplotlib.org/gallery/statistics/barchart_demo.html

## 6. The Functional API - a way to build graphs of layers

The Keras functional API is a way to create models that are more flexible. The functional API can handle models with non-linear topology, shared layers, and even multiple inputs or outputs.

The main idea is that a deep learning model is usually a directed acyclic graph (DAG) of layers. So the functional API is a way to build graphs of layers.

In [None]:
# Using Keras


# To build this model using the functional API, start by creating an input node
# The shape of the data is set as the shape of our dataset. 
inputs = keras.Input(covid_df.shape)

In [None]:
# The inputs that is returned contains information about the shape and dtype of the input data that you feed to your model
inputs.shape

In [None]:
# here's the datatype
print(inputs.dtype)

In [None]:
# You create a new node in the graph of layers by calling a layer on this inputs object
dense = layers.Dense(64, activation="relu")
x = dense(inputs)

In [None]:
# The "layer call" action is like drawing an arrow from "inputs" to this layer you created. 
# You're "passing" the inputs to the dense layer, and you get x as the output.

# Let's add a few more layers to the graph of layers
x = layers.Dense(64, activation="relu")(x)
outputs = layers.Dense(10)(x)

In [None]:
# At this point, you can create a Model by specifying its inputs and outputs in the graph of layers
model = keras.Model(inputs=inputs, outputs=outputs, name="mnist_model")

In [None]:
# Let's check out what the model summary looks like:
model.summary()

In [None]:
# You can also plot the model as a graph:

# NB: To implement this though, you must 'pip intsall pydot'
# and install graphviz(https://graphviz.gitlab.io/download/) i.e 'sudo apt install graphviz' for 'pydotprint' to work
keras.utils.plot_model(model, "sample_model.png")

A "graph of layers" is an intuitive mental image for a deep learning model, and the functional API is a way to create models that closely mirrors this.

This example was adapted from https://keras.io/guides/functional_api/

## 7. Pre-processing data into a form suitable for training

This section focuses on the loading, and gives some quick examples of preprocessing.

In [None]:
# Using Tensorflow


# Setup

import tensorflow as tf
from tensorflow.keras import layers
from tensorflow.keras.layers.experimental import preprocessing


# For any small CSV dataset the simplest way to train a TensorFlow model on it is to load it into memory as a pandas Dataframe 
# or a NumPy array, which we have already done.

# Let's assume the nominal task for our dataset is to predict covid deaths from the other measurements, 
# so we separate the features and labels for training.
del covid_df['Country name']
covid_features = covid_df.copy()
covid_labels = covid_features.pop('09/11/2020')

In [None]:
# For this dataset we will treat all features identically. 
# Pack the features into a single NumPy array.

covid_features = np.array(covid_features)
covid_features

In [None]:
# Next make a regression model predict the covid deaths. 
# Since there is only a single input tensor, a keras.
# Sequential model is sufficient here.

covid_model = tf.keras.Sequential([
  layers.Dense(64),
  layers.Dense(1)
])

covid_model.compile(loss = tf.losses.MeanSquaredError(),
                      optimizer = tf.optimizers.Adam())

In [None]:
# To train that model, pass the features and labels to Model.fit

covid_model.fit(covid_features, covid_labels, epochs=10)

We have just seen the most basic way to train a model using CSV data!

##### SciPy

SciPy is a free and open-source Python library used for scientific computing and technical computing. SciPy contains modules for optimization, linear algebra, integration, interpolation, special functions, FFT, signal and image processing, ODE solvers and other tasks common in science and engineering.

In [None]:
from scipy import misc

face = misc.face()
plt.imshow(face)
plt.show()

##### Scikit-Learn

Scikit-learn (formerly scikits.learn and also known as sklearn) is a free software machine learning library for the Python programming language. It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy.

## 8. Simple linear regression

In [None]:
# Using sklearn


# We are assuming the variable '09/11/2020' is the target that the model predicts. 
# All other variables are used as predictors, also called features.

# Setup

from scipy.stats import pearsonr
from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from mpl_toolkits.mplot3d import Axes3D
import statsmodels.formula.api as sm

normalize = preprocessing.Normalization()
normalize.adapt(covid_features)

# Define features as X, target as y.
X = covid_labels
y = covid_features

In [None]:
# Splitting the dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

A linear regression consists of a coefficient for each feature and one intercept.

To make a prediction, each feature is multiplied by its coefficient. The intercept and all of these products are added together. This sum is the predicted value of the target variable.

The residual sum of squares (RSS) is calculated to measure the difference between the prediction and the actual value of the target variable.

The function fit calculates the coefficients and intercept that minimize the RSS when the regression is used on each record in the training set.

In [6]:
# Fitting Simple Linear Regression to the Training set
regressor = LinearRegression()
regressor.fit(X_train, y_train)

# The intercept
print('Intercept: \n', regressor.intercept_)

# The coefficients
print('Coefficients: \n', pd.Series(regressor.coef_, index=X.columns, name='coefficients'))

NameError: name 'LinearRegression' is not defined