# Tutorial 1 (Introduction to AI)

# Python and Data

In the "Introduction to AI" module, we will use **Python** as a programming language since it has an intuitive syntax, basic control flow, and data structures. It also supports interpretive run-time without standard compiler languages. This makes Python especially useful for prototyping algorithms for AI.


Python comes with a huge amount of inbuilt libraries and many of the libraries are for Artificial Intelligence. Some of the libraries we will use are Scikit-learn, Keras, Tensorflow (which is high-level neural network library). The list keeps going and never ends.

In this module we will use the following software:
    
* **Python** - The programming language.
* **Pandas** - Allows for data preprocessing.  Tutorials [here](https://pandas.pydata.org/pandas-docs/stable/getting_started/intro_tutorials/index.html)
* **Scikit-Learn** - Machine learning framework for Python.  Tutorial [here](http://scikit-learn.org/stable/tutorial/basic/tutorial.html).

and :

* **Keras** - [Keras](https://keras.io) is a high-level neural networks API, written in Python and capable of running on top of TensorFlow.
* **TensorFlow** - Google's deep learning framework, must have the version specified below.



## Code and Development in Python

Tutorials for this module will be presented as **Jupyter Notebooks**.  For your own development work you should use the **Spyder** IDE.  Both of these come with the **Anaconda** distribution and you should install this on your own machines.

# Installing Anaconda

Go to the download page for Anaconda, https://www.anaconda.com/distribution/ and follow the instructions.

After installation, Jupyter Notebooks can be launched from the menu, or by the command in the console:

`jupyter notebook`

You might need to specify a folder on launch.  For example:

`jupyter notebook --notebook-dir D:/my_works/`

The page should open in the browser in a few seconds (or you can go to http://localhost:8888/ - usually a local "server" of Jupyter Notebook is there).

The following packages are needed for this module (you might find that many are already installed):

```
conda install scipy
pip install --upgrade sklearn
pip install --upgrade pandas
pip install --upgrade pandas-datareader
pip install --upgrade matplotlib
pip install --upgrade pillow
pip install --upgrade requests
pip install --upgrade h5py
pip install --upgrade tensorflow==2.6.0
```


## Working with Spyder

To create a new file, select New File from the File menu, and rename as you wish.

- Spyder comes with helpful debugging tool, allowing you to step through your code, etc.
- Output is written to the console, and this might well be graphics and diagrams, as well as print statements.


##  Working with Jupyter Notebook

Jupyter Notebooks are a good way to present work with comments, and the tutorials for this module will be presented in this way.

To create your own, click New -> Python 3.

The notebook consists of cells which are text (Markdown) and code (Code). If run, the code cells of a notebook make up a single program.  You can select a cell type on the control panel.

Working with cells:
<ul>
    <li> Select a cell - click on it</li>
    <li> Editing - click on it twice</li>
    <li> Running the cell — `SHIFT+ENTER`or click on the button <button class='fa fa-play icon-play btn btn-xs btn-default'></button> on the panel</li>
    <li> Adding a new cell - click on the button <button class='fa fa-plus icon-plus btn btn-xs btn-default'></button> on the panel</li>
    <li> Deleting a cell - click on the button <button class='fa fa-cut icon-cut btn btn-xs btn-default'></button> on the panel</li>
    <li> Moving a cell - click on the vertical arrows</li>
</ul>

There are two commonly used versions of Python — **Python 2** and **Python 3**. These versions are quite similar but there are differences because of which they **are not compatible** - programs written in one version may not work in the other.

In **"Introduction to AI"** module we are using **Python 3**.

The exact Python version is not so important but it should be >= 3.5, with 3.8 being the current and recommended version.

**Note:** You can check your Python version at the command line by running `python --version`.

If you use any of the Linux distributions then Python is probably already installed. Try the following commands in the terminal to start interactive mode:

`python` or `python3` or `python2`

Exit: `Ctrl+D`

The mode of operation in which the code from main.py will run

`python main.py`

Help: **`help(X)`**, where `X` — is what help is needed for

Help exit: `q`.

In [None]:
# What version of Python do you have?
import sys

import keras
import pandas as pd
import sklearn as sk
import tensorflow as tf

print(f"Tensor Flow Version: {tf.__version__}")
print(f"Keras Version: {keras.__version__}")
#print()
print(f"Python {sys.version}")
print(f"Pandas {pd.__version__}")
print(f"Scikit-Learn {sk.__version__}")

Using TensorFlow backend.


Tensor Flow Version: 2.3.0
Keras Version: 2.3.1
Python 3.7.3 (default, Mar 27 2019, 16:54:48) 
[Clang 4.0.1 (tags/RELEASE_401/final)]
Pandas 0.24.2
Scikit-Learn 0.20.3


# What is the "Introduction to AI" module about?

- The "Introdcution to AI" module introduces artificial intelligence, machine learning and neural computing as both technical subjects and as fields of intellectual activity.


- We introduce machine learning as an alternative knowledge acquisition/representation paradigm, to explain its basic principles and to describe a range of techniques, including neural computing and their application areas.


- A range of machine learning techniques will be covered, leading up to neural networks and deep learning, a popular type of machine learning that is based upon the original neural networks popularized in the 1980's.  A deep neural network is nothing more than a neural network with many layers.  While we've always been able to create/calculate deep neural networks, we've lacked an effective means of training them.  Deep learning provides an efficient means to train deep neural networks.

If deep learning is a type of machine learning, this begs the question, "What is machine learning?"  The following diagram illustrates how machine learning differs from traditional software development.

![class_1_ml_vs_trad.png](attachment:class_1_ml_vs_trad.png)

* **Traditional Software Development** - Programmers create programs that specify how to transform input into the desired output.
* **Machine Learning** - Programmers create models that can learn to produce the desired output for given input. This learning fills the traditional role of the computer program.

### Python basics

#### 1. Count to 10

Use a `for` loop and a `range`.

In [None]:
for x in range(1, 10):  # If you ever see xrange, you are in Python 2
    print(x)  # If you ever see print x (no parenthesis), you are in Python 2

1
2
3
4
5
6
7
8
9


#### 2. Printing Numbers and Strings

In [None]:
acc = 0
for x in range(1, 10):
    acc += x
    print(f"Adding {x}, sum so far is {acc}")

print(f"Final sum: {acc}")

Adding 1, sum so far is 1
Adding 2, sum so far is 3
Adding 3, sum so far is 6
Adding 4, sum so far is 10
Adding 5, sum so far is 15
Adding 6, sum so far is 21
Adding 7, sum so far is 28
Adding 8, sum so far is 36
Adding 9, sum so far is 45
Final sum: 45


#### 3.  Lists and Sets

In [None]:
c = ['a', 'b', 'c', 'd']
print(c)

['a', 'b', 'c', 'd']


In [None]:
# Iterate over a collection.
for s in c:
    print(s)

a
b
c
d


In [None]:
# Iterate over a collection, and know where your index.  (Python is zero-based!)
for i,c in enumerate(c):
    print(f"{i}:{c}")

0:a
1:b
2:c
3:d


In [None]:
# Manually add items, lists allow duplicates
c = []
c.append('a')
c.append('b')
c.append('c')
c.append('c')
print(c)

['a', 'b', 'c', 'c']


In [None]:
# Manually add items, sets do not allow duplicates
# Sets add, lists append.
c = set()
c.add('a')
c.add('b')
c.add('c')
c.add('c')
print(c)

{'b', 'c', 'a'}


In [None]:
# Insert
c = ['a', 'b', 'c']
c.insert(0, 'a0')
print(c)
# Remove
c.remove('b')
print(c)
# Remove at index
del c[0]
print(c)

['a0', 'a', 'b', 'c']
['a0', 'a', 'c']
['a', 'c']


#### 4.  Maps/Dictionaries/Hash Tables

In [None]:
d = {'name': "Richard", 'address':"7 Build"}
print(d)
print(d['name'])

if 'name' in d:
    print("Name is defined")

if 'age' in d:
    print("age defined")
else:
    print("age undefined")

{'name': 'Richard', 'address': '7 Build'}
Richard
Name is defined
age undefined


In [None]:
d = {'name': "Richard", 'address':"7 Build"}
# All of the keys
print(f"Key: {d.keys()}")

# All of the values
print(f"Values: {d.values()}")

Key: dict_keys(['name', 'address'])
Values: dict_values(['Richard', '7 Build'])


In [None]:
# Python list & map structures
customers = [
    {'name': 'Richard & Lesley Jonson', 'pets': ['Tor', 'Oscar', 'Coco']},
    {'name': 'Anna McCartney', 'pets': ['sam']},
    {'name': 'Alexa Mason'}
]

print(customers)

for customer in customers:
    print(f"{customer['name']}:{customer.get('pets', 'no pets')}")

[{'name': 'Richard & Lesley Jonson', 'pets': ['Tor', 'Oscar', 'Coco']}, {'name': 'Anna McCartney', 'pets': ['sam']}, {'name': 'Alexa Mason'}]
Richard & Lesley Jonson:['Tor', 'Oscar', 'Coco']
Anna McCartney:['sam']
Alexa Mason:no pets


# Python for AI

Pandas
======
[Pandas](http://pandas.pydata.org/) is an open source library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.  It is based on the [dataframe](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html) concept.  For this module, Pandas will be the primary means by which data is manipulated.

The dataframe is a key component of Pandas.  We will use it to access the [auto-mpg dataset](https://archive.ics.uci.edu/ml/datasets/Auto+MPG).  This dataset can be found on the UCI machine learning repository.  For this module we will use a version of the Auto MPG dataset where column headers were added.

This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University. The dataset was used in the 1983 American Statistical Association Exposition.  It contains data for 398 cars, including [mpg](https://en.wikipedia.org/wiki/Fuel_economy_in_automobiles), [cylinders](https://en.wikipedia.org/wiki/Cylinder_(engine)), [displacement](https://en.wikipedia.org/wiki/Engine_displacement), [horsepower](https://en.wikipedia.org/wiki/Horsepower) , weight, acceleration, model year, origin and the car's name.

The following code loads the MPG dataset into a dataframe:

In [None]:
import os
import pandas as pd

path = "."  #absolute or relative path to the folder containing the file.
            #"." for current folder

filename_read = os.path.join(path, "auto-mpg.csv")
df = pd.read_csv(filename_read)
print(df[0:5])

    mpg  cylinders  displacement horsepower  weight  acceleration  year  \
0  18.0          8         307.0        130    3504          12.0    70   
1  15.0          8         350.0        165    3693          11.5    70   
2  18.0          8         318.0        150    3436          11.0    70   
3  16.0          8         304.0        150    3433          12.0    70   
4  17.0          8         302.0        140    3449          10.5    70   

   origin                       name  
0       1  chevrolet chevelle malibu  
1       1          buick skylark 320  
2       1         plymouth satellite  
3       1              amc rebel sst  
4       1                ford torino  


In [None]:
# Perform basic statistics (mean, variance, standard deviation) on a dataframe.

import os
import pandas as pd

path = "."

filename_read = os.path.join(path, "auto-mpg.csv")
df = pd.read_csv(filename_read, na_values=['NA', '?'])

# Strip non-numerics
df = df.select_dtypes(include=['int', 'float'])

headers = list(df.columns.values)
fields = []

for field in headers:
    fields.append({
        'name' : field,
        'mean': df[field].mean(),
        'var': df[field].var(),
        'sdev': df[field].std()
    })

for field in fields:
    print(field)

{'name': 'mpg', 'mean': 23.514572864321615, 'var': 61.089610774274405, 'sdev': 7.815984312565782}
{'name': 'cylinders', 'mean': 5.454773869346734, 'var': 2.8934154399199943, 'sdev': 1.7010042445332094}
{'name': 'displacement', 'mean': 193.42587939698493, 'var': 10872.199152247364, 'sdev': 104.26983817119581}
{'name': 'horsepower', 'mean': 104.46938775510205, 'var': 1481.5693929745862, 'sdev': 38.49115993282855}
{'name': 'weight', 'mean': 2970.424623115578, 'var': 717140.9905256768, 'sdev': 846.8417741973271}
{'name': 'acceleration', 'mean': 15.568090452261291, 'var': 7.604848233611381, 'sdev': 2.7576889298126757}
{'name': 'year', 'mean': 76.01005025125629, 'var': 13.672442818627143, 'sdev': 3.697626646732623}
{'name': 'origin', 'mean': 1.5728643216080402, 'var': 0.6432920268850575, 'sdev': 0.8020548777266163}


### Sorting and Shuffling Data frames

In [None]:
import os
import pandas as pd
import numpy as np

path = "."

filename_read = os.path.join(path, "auto-mpg.csv")
df = pd.read_csv(filename_read, na_values=['NA', '?'])
#np.random.seed(42) # Uncomment this line to get the same shuffle each time
                    # (good for replication of results)
df = df.reindex(np.random.permutation(df.index))
df.reset_index(inplace=True, drop=True)
df

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name
0,29.0,4,90.0,70.0,1937,14.0,75,2,volkswagen rabbit
1,14.5,8,351.0,152.0,4215,12.8,76,1,ford gran torino
2,31.9,4,89.0,71.0,1925,14.0,79,2,vw rabbit custom
3,22.0,6,146.0,97.0,2815,14.5,77,3,datsun 810
4,26.6,8,350.0,105.0,3725,19.0,81,1,oldsmobile cutlass ls
5,25.4,6,168.0,116.0,2900,12.6,81,3,toyota cressida
6,24.0,4,120.0,97.0,2489,15.0,74,3,honda civic
7,37.2,4,86.0,65.0,2019,16.4,80,3,datsun 310
8,27.0,4,97.0,88.0,2130,14.5,71,3,datsun pl510
9,21.0,6,199.0,90.0,2648,15.0,70,1,amc gremlin


In [None]:
import os
import pandas as pd

path = "."

filename_read = os.path.join(path, "auto-mpg.csv")
df = pd.read_csv(filename_read, na_values=['NA', '?'])
df = df.sort_values(by='name', ascending=True)
print(f"The first car is: {df['name'].iloc[0]}")
print(df[0:5])

The first car is: amc ambassador brougham
      mpg  cylinders  displacement  horsepower  weight  acceleration  year  \
96   13.0          8         360.0       175.0    3821          11.0    73   
9    15.0          8         390.0       190.0    3850           8.5    70   
66   17.0          8         304.0       150.0    3672          11.5    72   
315  24.3          4         151.0        90.0    3003          20.1    80   
257  19.4          6         232.0        90.0    3210          17.2    78   

     origin                     name  
96        1  amc ambassador brougham  
9         1       amc ambassador dpl  
66        1       amc ambassador sst  
315       1              amc concord  
257       1              amc concord  


### Saving a Data frame

In [None]:
import os
import pandas as pd
import numpy as np

path = "."

filename_read = os.path.join(path, "auto-mpg.csv")
filename_write = os.path.join(path, "auto-mpg-shuffle.csv")
df = pd.read_csv(filename_read, na_values=['NA', '?'])
df = df.reindex(np.random.permutation(df.index))
df.to_csv(filename_write, index=False) # Specify index = false to not write row numbers
print("Done")

Done


### Dropping Fields

Some fields are of no value to the model (say, neural network) and can be dropped.  The following code removes the name column from the MPG dataset.

In [None]:
import os
import pandas as pd

path = "."

filename_read = os.path.join(path, "auto-mpg.csv")
df = pd.read_csv(filename_read, na_values=['NA', '?'])

print(f"Before drop: {df.columns}")
df.drop('name', 1, inplace=True)
print(f"After drop: {df.columns}")

Before drop: Index(['mpg', 'cylinders', 'displacement', 'horsepower', 'weight',
       'acceleration', 'year', 'origin', 'name'],
      dtype='object')
After drop: Index(['mpg', 'cylinders', 'displacement', 'horsepower', 'weight',
       'acceleration', 'year', 'origin'],
      dtype='object')


### Calculated Fields

It is possible to add new fields to the dataframe that are calculated from the other fields.  We can create a new column that gives the weight in kilograms.  The equation to calculate a metric weight, given a weight in pounds is:

$ m_{(kg)} = m_{(lb)} \times 0.45359237 $

This can be used with the following Python code:

In [None]:
import os
import pandas as pd

path = "."

filename_read = os.path.join(path, "auto-mpg.csv")
df = pd.read_csv(filename_read, na_values=['NA', '?'])
df.insert(1, 'weight_kg', (df['weight'] * 0.45359237).astype(int))
df

Unnamed: 0,mpg,weight_kg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name
0,18.0,1589,8,307.0,130.0,3504,12.0,70,1,chevrolet chevelle malibu
1,15.0,1675,8,350.0,165.0,3693,11.5,70,1,buick skylark 320
2,18.0,1558,8,318.0,150.0,3436,11.0,70,1,plymouth satellite
3,16.0,1557,8,304.0,150.0,3433,12.0,70,1,amc rebel sst
4,17.0,1564,8,302.0,140.0,3449,10.5,70,1,ford torino
5,15.0,1969,8,429.0,198.0,4341,10.0,70,1,ford galaxie 500
6,14.0,1974,8,454.0,220.0,4354,9.0,70,1,chevrolet impala
7,14.0,1955,8,440.0,215.0,4312,8.5,70,1,plymouth fury iii
8,14.0,2007,8,455.0,225.0,4425,10.0,70,1,pontiac catalina
9,15.0,1746,8,390.0,190.0,3850,8.5,70,1,amc ambassador dpl


### Field Transformation & Preprocessing

The data fed into a machine learning model rarely bares much similarity to the data that the data scientist originally received. One common transformation is to normalize the inputs.  A normalization allows numbers to be put in a standard form so that two values can easily be compared.  Consider if a friend told you that he received a $10 discount.  Is this a good deal?  Maybe.  But the value is not normalized.  If your friend purchased a car, then the discount is not that good.  If your friend purchased dinner, this is a very good discount!

Percentages are a very common form of normalization.  If your friend tells you they got 10% off, we know that this is a better discount than 5%.  It does not matter how much the purchase price was.  One very common machine learning normalization is the Z-Score:

$z = {x- \mu \over \sigma} $

To calculate the Z-Score you need to also calculate the mean($\mu$) and the standard deviation ($\sigma$).  The mean is calculated as follows:

$\mu = \bar{x} = \frac{x_1+x_2+\cdots +x_n}{n}$

The standard deviation is calculated as follows:

$\sigma = \sqrt{\frac{1}{N} \sum_{i=1}^N (x_i - \mu)^2}, {\rm \ \ where\ \ } \mu = \frac{1}{N} \sum_{i=1}^N x_i$

The following Python code replaces the mpg with a z-score.  Cars with average MPG will be near zero, above zero is above average, and below zero is below average.  Z-Scores above/below -3/3 are very rare, these are outliers.

In [None]:
import os
import pandas as pd
from scipy.stats import zscore

path = "."

filename_read = os.path.join(path, "auto-mpg.csv")
df = pd.read_csv(filename_read, na_values=['NA', '?'])
df['mpg'] = zscore(df['mpg'])
df

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name
0,-0.706439,8,307.0,130.0,3504,12.0,70,1,chevrolet chevelle malibu
1,-1.090751,8,350.0,165.0,3693,11.5,70,1,buick skylark 320
2,-0.706439,8,318.0,150.0,3436,11.0,70,1,plymouth satellite
3,-0.962647,8,304.0,150.0,3433,12.0,70,1,amc rebel sst
4,-0.834543,8,302.0,140.0,3449,10.5,70,1,ford torino
5,-1.090751,8,429.0,198.0,4341,10.0,70,1,ford galaxie 500
6,-1.218855,8,454.0,220.0,4354,9.0,70,1,chevrolet impala
7,-1.218855,8,440.0,215.0,4312,8.5,70,1,plymouth fury iii
8,-1.218855,8,455.0,225.0,4425,10.0,70,1,pontiac catalina
9,-1.090751,8,390.0,190.0,3850,8.5,70,1,amc ambassador dpl


### Missing Values

Missing values are a reality of machine learning.  Ideally every row of data will have values for all columns.  However, this is rarely the case.  Most of the values are present in the MPG database.  However, there are missing values in the horsepower column.  A common practice is to replace missing values with the median value for that column.  The median is calculated as described [here](https://www.mathsisfun.com/median.html).  The following code replaces any NA values in horsepower with the median:

In [None]:
import os
import pandas as pd

path = "."

filename_read = os.path.join(path, "auto-mpg.csv")
df = pd.read_csv(filename_read, na_values=['NA', '?'])
med = df['horsepower'].median()
df['horsepower'] = df['horsepower'].fillna(med)
# df = df.dropna() # you can also simply drop NA values
print(f"horsepower has na? {pd.isnull(df['horsepower']).values.any()}")

horsepower has na? False


### Concatenating Rows and Columns

Rows and columns can be concatenated together to form new data frames.

In [None]:
# Create a new dataframe from name and horsepower

import os
import pandas as pd

path = "."

filename_read = os.path.join(path, "auto-mpg.csv")
df = pd.read_csv(filename_read, na_values=['NA', '?'])
col_horsepower = df['horsepower']
col_name = df['name']
result = pd.concat([col_name, col_horsepower], axis=1)
result

Unnamed: 0,name,horsepower
0,chevrolet chevelle malibu,130.0
1,buick skylark 320,165.0
2,plymouth satellite,150.0
3,amc rebel sst,150.0
4,ford torino,140.0
5,ford galaxie 500,198.0
6,chevrolet impala,220.0
7,plymouth fury iii,215.0
8,pontiac catalina,225.0
9,amc ambassador dpl,190.0


### Accessing Files directly

It is possible to access files directly, rather than using Pandas.  For class assignments you should use Pandas; however, direct access is possible.  Using the CSV package, you can read the files in, line-by-line and process them.  Accessing a file line-by-line can allow you to process very large files that would not fit into memory.  For the purposes of this class, all files will fit into memory, and you should use Pandas for all class assignments.  

In [None]:
# Read a raw text file (avoid this)
import codecs
import os

path = "."

# Always specify your encoding! There is no such thing as "its just a text file".
# See... http://www.joelonsoftware.com/articles/Unicode.html
# Also see... http://www.utf8everywhere.org/
encoding = 'utf-8'
filename = os.path.join(path, "auto-mpg.csv")

c = 0

with codecs.open(filename, "r", encoding) as fh:
    # Iterate over this line by line...
    for line in fh:
        c += 1 # Only the first 5 lines
        if c > 5:
            break
        print(line.strip())

mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name
18,8,307,130,3504,12,70,1,chevrolet chevelle malibu
15,8,350,165,3693,11.5,70,1,buick skylark 320
18,8,318,150,3436,11,70,1,plymouth satellite
16,8,304,150,3433,12,70,1,amc rebel sst


In [None]:
# Read a CSV file
import codecs
import os
import csv

encoding = 'utf-8'
path = "."
filename = os.path.join(path, "auto-mpg.csv")

c = 0

with codecs.open(filename, "r", encoding) as fh:
    reader = csv.reader(fh)
    for row in reader:
        c += 1
        if c > 5:
            break
        print(row)

['mpg', 'cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'year', 'origin', 'name']
['18', '8', '307', '130', '3504', '12', '70', '1', 'chevrolet chevelle malibu']
['15', '8', '350', '165', '3693', '11.5', '70', '1', 'buick skylark 320']
['18', '8', '318', '150', '3436', '11', '70', '1', 'plymouth satellite']
['16', '8', '304', '150', '3433', '12', '70', '1', 'amc rebel sst']


In [None]:
# Read a CSV, symbolic headers
import codecs
import os
import csv

path = "."

encoding = 'utf-8'
filename = os.path.join(path, "auto-mpg.csv")

c = 0

with codecs.open(filename, "r", encoding) as fh:
    reader = csv.reader(fh)

    # Generate header index using comprehension.
    # Comprehension is cool, but not necessarily a beginners feature of Python.
    header_idx = {key: value for (value, key) in enumerate(next(reader))}

    for row in reader:
        c += 1
        if c > 5:
            break
        print(f"Car Name: {row[header_idx['name']]}")

Car Name: chevrolet chevelle malibu
Car Name: buick skylark 320
Car Name: plymouth satellite
Car Name: amc rebel sst
Car Name: ford torino


In [None]:
# Read a CSV, manual stats
import codecs
import os
import csv
import math

path = "."

encoding = 'utf-8'
filename_read = os.path.join(path, "auto-mpg.csv")
filename_write = os.path.join(path, "auto-mpg-norm.csv")

c = 0

with codecs.open(filename_read, "r", encoding) as fh:
    reader = csv.reader(fh)

    # Generate header index using comprehension.
    # Comprehension is cool, but not necessarily a beginners feature of Python.
    header_idx = {key: value for (value, key) in enumerate(next(reader))}
    headers = header_idx.keys()

    fields = {key: value for (key, value) in [(key, {'count':0, 'sum':0, 'variance':0}) for key in headers]}

    # Pass 1, means
    row_count = 0
    for row in reader:
        row_count += 1
        for name in headers:
            try:
                value = float(row[header_idx[name]])
                field = fields[name]
                field['count'] += 1
                field['sum'] += value
            except ValueError:
                pass

    # Calculate means, toss sums (part of pass 1)
    for field in fields.values():
        # If 90% are not missing (or non-numeric) calculate a mean
        if (field['count'] / row_count) > 0.9:
            field['mean'] = field['sum'] / field['count']
            del field['sum']

    # Pass 2, standard deviation & variance
    fh.seek(0)
    for row in reader:
        for name in headers:
            try:
                value = float(row[header_idx[name]])
                field = fields[name]
                # If we failed to calculate a mean, no variance.
                if 'mean' in field:
                    field['variance'] += (value - field['mean'])**2
            except ValueError:
                pass

    # Calculate standard deviation, keep variance (part of pass 2)
    for field in fields.values():
        # If no variance, then no standard deviation
        if 'mean' in field:
            field['variance'] /= field['count']
            field['sdev'] = math.sqrt(field['variance'])
        else:
            del field['variance']

    # Print summary stats
    for key in sorted(fields.keys()):
        print(f"{key}:{fields[key]}")

acceleration:{'count': 398, 'variance': 7.585740574732961, 'mean': 15.568090452261291, 'sdev': 2.7542223175940177}
cylinders:{'count': 398, 'variance': 2.8861455518799946, 'mean': 5.454773869346734, 'sdev': 1.698865960539558}
displacement:{'count': 398, 'variance': 10844.882068950259, 'mean': 193.42587939698493, 'sdev': 104.13876352708563}
horsepower:{'count': 392, 'variance': 1477.7898792169979, 'mean': 104.46938775510205, 'sdev': 38.442032714425984}
mpg:{'count': 398, 'variance': 60.93611928991693, 'mean': 23.514572864321615, 'sdev': 7.806159061274433}
name:{'count': 0, 'sum': 0}
origin:{'count': 398, 'variance': 0.6416757152597181, 'mean': 1.5728643216080402, 'sdev': 0.801046637381194}
weight:{'count': 398, 'variance': 715339.1287404363, 'mean': 2970.424623115578, 'sdev': 845.7772335198177}
year:{'count': 398, 'variance': 13.638089947223559, 'mean': 76.01005025125629, 'sdev': 3.6929784655780975}


# Exercises

For these exercises, you should use **Spyder** as your IDE.  


**Question 1** Write Python code to produce the small multiplication square below.  You might  want to refer to the documentation https://docs.python.org/3/index.html for print, and consider the use of the end argument, and the ljust method.   

<table>
<tr>
    <td>1</td> <td>2</td> <td>3</td>  <td>4</td>  <td>5</td>
</tr>
<tr>
    <td>2</td>  <td>4</td>  <td>6</td>  <td>8</td>  <td>10</td>
</tr>
<tr>
    <td>3</td>  <td>6</td>  <td>9</td>  <td>12</td> <td>15</td>
</tr>
<tr>
    <td>4</td>  <td>8</td>  <td>12</td> <td>16</td> <td>20</td>
</tr>
<tr>
    <td>5</td>  <td>10</td> <td>15</td> <td>20</td> <td>25</td>
</tr>
</table>
<p></p>
Now adapt this so that when an even number is found, a 0 is printed.<br>
<table>
<tr>
    <td>1</td> <td>0</td> <td>3</td>  <td>0</td>  <td>5</td>
</tr>
<tr>
    <td>0</td>  <td>0</td>  <td>0</td>  <td>0</td>  <td>0</td>
</tr>
<tr>
    <td>3</td>  <td>0</td>  <td>9</td>  <td>0</td> <td>15</td>
</tr>
<tr>
    <td>0</td>  <td>0</td>  <td>0</td> <td>0</td> <td>0</td>
</tr>
<tr>
    <td>5</td>  <td>0</td> <td>15</td> <td>0</td> <td>25</td>
</tr>
</table>

**Question 2** The iris dataset is a machine learning classic and will recur in the lectures and in these tutorials.  It contains 150 data points, 50 for each of three kinds of iris, and consists of data relating to their flowers.  Similarly to the examples in the tutorial above:

- Load the dataset
- Print the dataset
- Print the rows in the dataset where petal_w < 1.0
- Sort the dataset by sepal_l and print this
- Save the sorted dataset to a new file
- Calculate the variance of sepal_l