# AI Environment Creation and Testing

## Instructions

This is a Group assignment centered on showcasing your technical creativity skills in creating a functional or running environment for implementing AI experiments. The AI implementation environment is to be created on a single node with below mentioned libraries and tools. This will simulate an on-premise hardware environment which will be your machine. This environment should be running specifically on Linux platform (use any preferable distribution and those with Windows Operating System are expected virtualize their environment for Linux installation):

1. Latest Anaconda installation.
2. Latest version of Python
3. Jupyter Notebook
4. TensorFlow
5. Keras
6. NumPy
7. SciPy
8. Matplotlib
9. Pandas
10. Scikit-Learn
11. Other.

### After creating your environment, you should be able to provide:

1. A clear demonstration and explanation of the above mentioned libraries and tools through a specific dataset of your choice. You are free to use any datasets within an African context to provide proper data manipulation operations and procedures as a way of demonstrating functionality of your created environment.

2. Necessary comments on your Notebook.

3. Documentation of your work.

## Requirement 1

This Group assignment will simulate an on-premise hardware environment which will be your machine. 
This environment should be running specifically on Linux platform.

### 1.1 Linux platform checklist

In [1]:
# Linux Installation

!lsb_release -d

Description:	Ubuntu 18.04.5 LTS


## Requirement 2

The AI implementation environment is to be created on a single node with below mentioned libraries and tools.

### 1.2 Tools and libraries checklist

In [2]:
# Latest Anaconda Installation

!conda list anaconda$

/bin/bash: conda: command not found


In [3]:
# Latest Version of Python

!python3 -V

Python 3.8.0


In [4]:
# Jupyter Notebook Installation

!jupyter --version

jupyter core     : 4.7.0
jupyter-notebook : 6.2.0
qtconsole        : 5.0.1
ipython          : 7.16.1
ipykernel        : 5.4.3
jupyter client   : 6.1.11
jupyter lab      : not installed
nbconvert        : 6.0.7
ipywidgets       : 7.6.3
nbformat         : 5.1.2
traitlets        : 4.3.3


In [5]:
# TensorFlow Installation

!pip3 show tensorflow

In [6]:
# Keras Installation

!pip3 show keras

In [7]:
# Numpy Installation

!pip3 show numpy

In [8]:
# SciPy Installation

!pip3 show scipy

In [9]:
# Matplotlib Installation

!pip3 show matplotlib

In [10]:
# Pandas Installation

!pip3 show pandas

In [11]:
# Scikit-Learn Installation

!pip3 show scikit-learn

## Requirement 3

After creating your environment, you should be able to provide:

* A clear demonstration and explanation of the above mentioned libraries and tools through a specific dataset of your choice.

### 1.3 Data manipulation operations and procedures

Provide proper data manipulation operations and procedures as a way of demonstrating functionality of your created environment.


#### 1.3.1 Uploading files to our project

Open an editing session for the project, then choose the file you want to upload. 

* In Jupyter Notebook, click Upload and select the file to upload. Then click the blue Upload button displayed in the file’s row to add the file to the project

In [12]:
# Download and upload excel file 

# Listing files in working directory
%ls

'Assignment #1.ipynb'  'Assignment #1 (Updated).ipynb'   README.md


#### 1.3.2 Loading Data into our project

Once a file is in the project, you can use code to read it.

You are free to use any datasets within an African context to provide proper data manipulation operations and procedures as a way of demonstrating functionality of your created environment.

The dataset we will use contains information about Covid-19 death cases in Africa, per country, per day from the beginning of the pandemic.

* Dataset link for download or access: https://data.humdata.org/dataset/africa-covid-19-death-cases"

In [13]:
# Loading the covid deaths dataset from an excel file into a pandas DataFrame

import pandas as pd
coviddf = pd.read_excel("covid19_africa_deceased_hera.xlsx", index_col=0, header=2)
coviddf

ImportError: Missing optional dependency 'xlrd'. Install xlrd >= 1.0.0 for Excel support Use pip or conda to install xlrd.

In [None]:
# Check first 15 rows of data
coviddf.head(15)

#### 1.3.3 Accessing General Information about the Dataset

Getting to know more about the dataset by accessing its information

In [None]:
coviddf.info()

In [None]:
coviddf.shape

In [None]:
# Descriptions of each variable

coviddf.describe()

In [None]:
coviddf.dtypes

#### 1.3.4 Exploring project data

With Anaconda Enterprise, you can explore project data using visualization libraries such as Matplotlib, and numeric libraries such as NumPy, SciPy, and Pandas.

### Let's demonstrate use of these libraries below:

#### 1. Start by importing libraries, and reading data into a Pandas DataFrame

##### Pandas

In computer programming, pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. 

In [None]:
# import pandas as pd
# coviddf = pd.read_excel("covid19_africa_deceased_hera.xlsx", index_col=0, header=2)

#### 2. Listing column / variable names

In [None]:
print(coviddf.columns)

#### 3. Summary statistics (minimum, maximum, mean, median, percentiles)

In [None]:
print('length:', len(coviddf)) # length of data set
print('shape:', coviddf.shape) # length and width of data set
print('size:', coviddf.size) # length * width
print('min:', coviddf['10/11/2020'].min())
print('max:', coviddf['10/11/2020'].max())
print('mean:', coviddf['10/11/2020'].mean())
print('median:', coviddf['10/11/2020'].median())
print('50th percentile:', coviddf['10/11/2020'].quantile(0.5)) # 50th percentile, also known as median
print('5th percentile:', coviddf['10/11/2020'].quantile(0.05))
print('10th percentile:', coviddf['10/11/2020'].quantile(0.1))
print('95th percentile:', coviddf['10/11/2020'].quantile(0.95))

#### 4. Using value_counts function to show the number of items.

In [None]:
print(coviddf['10/11/2020'].value_counts())
print()
print(coviddf['10/11/2020'].value_counts(ascending=True))

#### 5. Time series data visualization

##### Numpy

NumPy is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.

##### Matplotlib

Matplotlib is a plotting library for the Python programming language and its numerical mathematics extension NumPy. It provides an object-oriented API for embedding plots into applications using general-purpose GUI toolkits like Tkinter, wxPython, Qt, or GTK+

In [None]:
# Creating four series of random numbers over time, 
# Calculating the cumulative sums for each series over time, 
# and plotting them

# Importing libraries

import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

coviddf = pd.DataFrame(np.random.randn(1000, 4), index=pd.date_range('1/1/2015', periods=1000), columns=list('ABCD'))
coviddf = coviddf.cumsum()
coviddf.plot()

This example was adapted from http://pandas.pydata.org/pandas-docs/stable/visualization.html

#### 6. Histograms

In [None]:
# Plotting a histogram of the 10/11/2020 column values in the Covid-deaths data set
plt.hist(coviddf['10/11/2020'])
plt.show()

#### Bar charts

In [None]:
# The following sample code produces a bar chart of the industries of customers in the customer data set.

day = coviddf['10/11/2020'].value_counts()

fig, ax = plt.subplots()

ax.bar(np.arange(len(day)), day)

ax.set_xlabel('day')
ax.set_ylabel('covid deaths')
ax.set_title('Covid Deaths per Day')
ax.set_xticks(np.arange(len(day)))
ax.set_xticklabels(day.index)

plt.show()

This example was adapted from https://matplotlib.org/gallery/statistics/barchart_demo.html

#### 1.3.5 Using Statistics

Anaconda Enterprise supports statistical work using the R language and Python libraries such as NumPy, SciPy, Pandas, Statsmodels, and scikit-learn.

The following Jupyter notebook Python examples show how to use these libraries to calculate correlations, distributions, regressions, and principal component analysis.

These examples also include plots produced with the libraries seaborn and Matplotlib

#### 1. Start by importing necessary libraries and functions

##### Scikit-Learn

Scikit-learn (formerly scikits.learn and also known as sklearn) is a free software machine learning library for the Python programming language. It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy.

In [None]:
import pandas as pd
import seaborn as sns
from scipy.stats import pearsonr
from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import statsmodels.formula.api as sm

%matplotlib inline

#### Simple linear regression

In [None]:
# The variable MEDV is the target that the model predicts. All other variables are used as predictors, also called features.

# The target variable is continuous, so use a linear regression instead of a logistic regression.

# Define features as X, target as y.
X = coviddf.drop('10/11/2020', axis='columns')
y = coviddf['10/11/2020']

In [None]:
# Splitting the dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

A linear regression consists of a coefficient for each feature and one intercept.

To make a prediction, each feature is multiplied by its coefficient. The intercept and all of these products are added together. This sum is the predicted value of the target variable.

The residual sum of squares (RSS) is calculated to measure the difference between the prediction and the actual value of the target variable.

The function fit calculates the coefficients and intercept that minimize the RSS when the regression is used on each record in the training set.

In [None]:
# Fitting Simple Linear Regression to the Training set
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

# The intercept
print('Intercept: \n', regressor.intercept_)

# The coefficients
print('Coefficients: \n', pd.Series(regressor.coef_, index=X.columns, name='coefficients'))

Now check the accuracy when this linear regression is used on new data that it was not trained on. That new data is the test set.

In [None]:
# Predicting the Test set results
y_pred = regressor.predict(X_test)

# Visualising the Test set results
# code adapted from https://joomik.github.io/Housing/
fig, ax = plt.subplots()
ax.scatter(y_test, y_pred, color='green')
ax.set(
    xlabel="Prices: $Y_i$",
    ylabel="Predicted prices: $\hat{Y}_i$",
    title="Prices vs Predicted prices: $Y_i$ vs $\hat{Y}_i$",
)
plt.show()

This scatter plot shows that the regression is a good predictor of the data in the test set.

The mean squared error quantifies this performance:

In [None]:
# The mean squared error as a way to measure model performance.
print("Mean squared error: %.2f" % mean_squared_error(y_test, y_pred))

##### TensorFlow

TensorFlow is a free and open-source software library for machine learning. It can be used across a range of tasks but has a particular focus on training and inference of deep neural networks. Tensorflow is a symbolic math library based on dataflow and differentiable programming.

##### Keras

Keras is an open-source software library that provides a Python interface for artificial neural networks. Keras acts as an interface for the TensorFlow library. 

##### SciPy

SciPy is a free and open-source Python library used for scientific computing and technical computing. SciPy contains modules for optimization, linear algebra, integration, interpolation, special functions, FFT, signal and image processing, ODE solvers and other tasks common in science and engineering.