<h1 align='center'> COMP2420/COMP6420 - Introduction to Data Management, Analysis and Security</h1>

<h2 align='center'> Lab 02 - NumPy, Pandas, and other Python Packages</h2>

*****

## Aim
Our aim in this lab is:
- Become familar with the use of the Python packages NumPy & Pandas.
- Learn some basic functionality of the aforementioned packages for future use

*****

## Learning Outcomes
There are no distinct linkages to the learning outcomes of the course, as this lab forms the baseline for using additional Python packages and other tools to acheive the learning outcomes described on [Programs & Courses](https://programsandcourses.anu.edu.au/course/COMP2420#learning-outcomes) in the coming weeks.

***

## Preparation

Before starting this lab, we suggest you complete the following:
- Watch the **Data Science Tools** lecture, it really will help!

<br>

The following documents will be useful for this lab:

- [NumPy Cheatsheet](./helpManuals/Numpy_CheatSheet.pdf)
- [Pandas Cheatsheet pt1](./helpManuals/Pandas_CheatSheet_1.pdf)
- [Pandas Cheatsheet pt2](./helpManuals/Pandas_CheatSheet_2.pdf)

## Detour: Introduction to Numpy & Pandas

In Lab01, we covered the basics on Python (among other items), however Python by itself is limited in it's functionality. Luckily, Python's development community is rich and a number of optional packages exist to provide the functionality we require. Two of the most widely used packages (and the two we will be covering), are **NumPy** and **Pandas**. In a 2018 survey of data scientists by [Figure-Eight](https://www.figure-eight.com), the most popular machine learning frameworks identified by respondents were Pandas and NumPy. Naturally, it seems like a good place to start on your journey into data science.

**NumPy**, initially released in 2006 for Python 2.x, and 2011 for Python 3.x, provides fast mathematical computation of arrays and matrices. Optimised using underlying functionality written in the _C_ programming language, NumPy provides much faster implementations of array functionality. Since, arrays and matrices are an essential part of the Machine Learning ecosystem, NumPy is commonly associated with other packages in the machine learning ecosystem  such as SciPy, Pandas and Matplotlib (to name a few).

**Pandas** is another widely used package for data analysis in the Python language. Originally written for performing quantitative analysis on finanical data by a US-based investment firm, Pandas provides easy to use and highly stable data manipulation & analysis tools. Once again optimised by the C programming language, Pandas provides a much faster implementation for 2 dimensional array-style objects, notwithstanding the simplicity of function calls for analysis.

We define packages such as **NumPy** and **Pandas** as additional (optional) packages for Python as they are not included in the standard Python installation. This is one of the reasons why we use the _Anaconda_ distribution as our Python package manager, as it installs these items by default. Without this, there would be additional installs necessary through the _pip_ package manager (not covered) to ensure we had all the right tools. Long story short, be glad we are using Anaconda where everything is ready to go!

Before getting into the lab, lets show a quick little experiment on the C vs Python implementation debate.

### Detour: Nothing on NumPy

The following is two examples of adding two lists, each with 10 million items. We find the mean of 5 runs, as this provides a better indication of the speed taken to perform this action. Otherwise, other processes on your computer might skew the results slightly. 

Notice the speed of the NumPy implementation is much faster than the traditional Python implemention. Maybe those "C purists" are onto something...

In [74]:
# Module Imports 
# Notice how we follow the same setup as shown in Lab01
# Without this, the Python environment does not know these items exist
# These packages are usually optional for Python, 
#     but are installed in the Anaconda distribution
import numpy as np 
import pandas as pd

# Including extra packages for the sake of the experiment.
# These packages are inbuilt to the standard Python distribution
from timeit import default_timer as timer
import statistics as st

# Ignoring warnings for now.
# This is because there is some depreciated functionality NumPy uses that is outside of
#     our control
import warnings
warnings.filterwarnings("ignore")

In [75]:
# A function to determine the speed taking to generate & add 
#     two lists of 10 million items.
# The action of addition is performed a number of times to gain an average.
def trad_listaddition():
    times = []
    for x in range(5):
        start = timer()
        # range() provides an iterator object, so we must unpack this into a list 
        #     for the fast X+Y function
        X = [*range(10000000)] 
        Y = [*range(10000000)]
        Z = [X[i] + Y[i] for i in range(10000000)]
        end = timer()
        times.append(end - start)
    return round(st.mean(times), 3)

def numpy_listaddition():
    times = np.zeros([5]) # Notice we have to define the list size at the start
    for x in range(5):
        start = timer()
        # np.arange() returns an ndarray, so no extra unpacking necessary
        X = np.arange(10000000) 
        Y = np.arange(10000000)
        Z = np.add(X,Y)
        end = timer()
        times[x] = end - start
    return np.round(np.mean(times), decimals=3)

# Notice we can specify a variable in our print function in a number of ways
print("The pure Python method takes %s seconds" % trad_listaddition())
print("The NumPy method takes", numpy_listaddition(), "seconds")

The pure Python method takes 2.677 seconds
The NumPy method takes 0.065 seconds


We don't even need to hardcode the experiment to ensure that NumPy will be faster on your computer, because it will be. Feel free to run the above code multiple times to prove the point!
*****

## Question 1: **Its About Time!**

Taking inspiration from the little speedtest we have above, the following exercise is designed to showcase how the C optimised code of NumPy will show vanilla Python who is boss, while teaching you some new functions in the process.

### Detour: Why do we care about time?

It may seem pedantic to consider the time it takes for smaller functions, but everything adds up. While theoretical computational complexity (such as "Big O" notation, covered in COMP1110, COMP3600 and other courses) is important, there is further considerations to be made when implementing programs to ensure the most effecient code is being written. Sometimes that means choosing between various programming languages to avoid compilers/intepreters and abstractions between human-readable code and machine code, or in our case that means choosing the best packages within a single language.

**Consider the example of entity resolution in datasets of over 10 million records.** For those not familar with entity resolution, [Data Community DC](http://www.datacommunitydc.org/blog/2013/08/entity-resolution-for-big-data) provides a short description:
> Entity Resolution is the task of disambiguating manifestations of real world entities in various records or mentions by linking and grouping. For example, there could be different ways of addressing the same person in text, different addresses for businesses, or photos of a particular object. This clearly has many applications, particularly in government and public health data, web search, comparison shopping, law enforcement, and more.

Implementing an entity resolution algorithm would require multiple passes over each applicable dataset multiple times. During this, a comparison would be required to check against every other entry in the dataset to check for duplicates, and then check for some similarity based on spelling errors, missing data, etc. While this comparison is necessary, it would be wise to find intuitive and smarter ways than a brute-force approach of traversing each list multiple times. Otherwise your algorithm might not finish this year! Another fight against time is the fact that new data would be constantly added to the list, meaning without a fast way of performing the calculation you would forever be behind, and end users requiring up to date information would suffer.

So in short, it is always about time.

For those interested (and as an extension), a number of ANU researchers have worked in this field and this paper is worth a read: [Scalable Entity Resolution Using Probabilistic Signatures on Parallel Databases](https://arxiv.org/pdf/1712.09691.pdf)

****

### Q1.1: Arranging a Flip
First things first. Filling in the functions (where pyflip is pure python, and npflip is pure numpy), perform the following actions (they can be performed in the same step):
- Produce a list of 50 million elements.
- Sort the list.
- Reverse the list.

We will provide the timing code in the functions for this question, but afterwards it is up to you to include this. If you have run the timing tests above, you should already have the relevent packages imported. Otherwise, you will need to go back and run these.

In [14]:
def pyflip():
    times = []
    for x in range(10):
        start = timer()
        ##############################################
        # Enter Code Here
        X = [*range(50000000)] 
        X.sort(reverse = True)
        end = timer()
        times.append(end - start)
    return round(st.mean(times), 3)

def npflip():
    times = np.zeros([10]) # Notice we have to define the list size at the start
    for x in range(10):
        start = timer()
        ##############################################
        # Enter Code Here
        X = np.arange(50000000)
        # np.flip(X) -> reverse
        (np.sort(X))[::-1]
        end = timer()
        times[x] = end - start
    return np.round(np.mean(times), decimals=3)


print("The pure Python method takes %s seconds" % pyflip())
print("The NumPy method takes %s seconds" % npflip())

The pure Python method takes 2.682 seconds
The NumPy method takes 0.956 seconds


<br>

### Q1.2: Degrees of Heat
Time to include a bit of math. Perform the following:
- Based on the input of an array of 5 million elements, convert every element of the array from Celsius (°C) to Fahrenheit (°F)

Hint: The equations are shown below:

<img src="./img/fcform.png" alt="Formulas" style="width: 200px;"/>

In [27]:
import random as rn

def pyDegree(pyInput):
    ##############################################
    # Enter Code Here
    # Don't forget timing code
    start = timer()
    result = [((x*9/5)+32) for x in pyInput]
    end = timer()
    return round(end-start, 3)

def npDegree(npInput):
    ##############################################
    # Enter Code Here
    # Don't forget timing code!
    start = timer()
    result = (npInput *9/5)+32
    end = timer()
    return np.round(end-start, decimals=3)

# Note inPy is not performed the fastest way, but is also not being counted in your timing.
inPy = [rn.randint(-25,50) for x in range(5000000)]
inNp = np.random.randint(-25, 50, size=5000000)


print("The pure Python method takes %s seconds" % pyDegree(inPy))
print("The NumPy method takes %s seconds" % npDegree(inNp))

The pure Python method takes 1.156 seconds
The NumPy method takes 0.048 seconds


<br>

### Q1.3: Meticulous Matching
Taking inspiration from the entity resolution discussion above, you have two tasks:

#### Q1.3.1
- From 2 input arrays (each containing 1,000,000 elements), find the positions in the array where the array entry is the same for a given index.

For example:
```Python
a = [1,2,3,4]
b = [1,3,5,4]
results = matchpositions(a,b)
print("The matching positions are: ", results) 
```
And the output would be `[0,3]`, because at positions 0 and 3 of both arrays the entries are the same.

A test case has been provided to ensure your code is performing correctly, while the timing case is held seperately so you can determine which is faster.

In [70]:
def PyMatch(listA, listB):
    times = []
    for x in range(10):
        result = []
        start = timer()
        ##############################################
        # Enter Code Here
        for x in range(len(listA)):
            if (listA[x] == listB[x]):
                result.append(x)
        end = timer()
        times.append(end - start)
    return (round(st.mean(times), 3), result)
# note that we return the time, and the list of matching locations as a tuple

def NpMatch(nlistA, nlistB):
    times = np.zeros([10]) # Notice we have to define the list size at the start
    for x in range(10):
        result = []
        start = timer()
        ##############################################
        # Enter Code Here
        result = np.where((nlistA== nlistB))[0]
        end = timer()
        times[x] = end - start
    return (np.round(np.mean(times), decimals=3), result)
# note that we return the time, and the list of matching locations as a tuple

In [71]:
# Test Case

# Python Test
testAPy = [1,2,3,4,5,6,7]
testBPy = [1,3,5,7,9,2,7]
resultPy = PyMatch(testAPy, testBPy)
if resultPy[1] == [0,6]:
    print("Test Passes - Python")
    print("You found: ", resultPy[1])
else:
    print("Test Failed - Python")
    print("You found: ", resultPy[1])
    
# Numpy Test
testANp = np.array([1,2,3,4,5,6,7])
testBNp = np.array([1,3,5,7,9,2,7])
resultNp = NpMatch(testANp, testBNp)
if np.array_equal(resultNp[1], np.array([0,6])):
    print("Test Passes - Numpy")
    print("You found: ", resultNp[1])
else:
    print("Test Failed - Numpy")
    print("You found: ", resultNp[1])

Test Passes - Python
You found:  [0, 6]
Test Passes - Numpy
You found:  [0 6]


In [72]:
# Timing Case
# This doesn't check if your code is correct, so no cheating!

# Note: Python generation is not performed the fastest way, but is also not being counted in your timing.
inAPy = [rn.randint(0,20) for x in range(10000)]
inBPy = [rn.randint(0,20) for y in range(10000)]

inANp = np.random.randint(0, 20, size=10000)
inBNp = np.random.randint(0, 20, size=10000)

print("The pure Python method takes %s seconds" % (PyMatch(inAPy, inBPy))[0])
print("The NumPy method takes %s seconds" % (NpMatch(inANp, inBNp))[0])

The pure Python method takes 0.001 seconds
The NumPy method takes 0.0 seconds


<br>

#### Q1.3.2
- From 2 input arrays (each containing 100,000 elements), find the elements that match in the lists and return the pairs of positions that match between each array. 

For example:
```Python
a = [1,2,3,4]
b = [1,3,5,4]
results = matchexist(a,b)
print("The matching positions are: ", results) 
```
And the output would be `[(0,0), (2,1), (3,3)]`, where the first element in each tuple is the position in the first array, and the second element in each tuple is the position in the second array. In the above example, `(0,0)` refers to `a[0]` and `b[0]` matching, and so on.

A test case has been provided to ensure your code is performing correctly, while the timing case is held seperately so you can determine which is faster.

In [None]:
def PyPairs(listA, listB):
    ##############################################
    # Enter Code Here
    # Don't forget the timing code!
    return

## THIS IS HARD TO DO IN PURE NUMPY WELL. 
# CONSIDER COMING BACK AFTER OTHER ITEMS
def NpPairs(nlistA, nlistB):
    ##############################################
    # Enter Code Here
    # Don't forget timing code!
    return

In [None]:
# Test Case

# Python Test
pairAPy = [1,2,3,4]
pairBPy = [1,5,7,2]
resultpairPy = PyPairs(pairAPy, pairBPy)
if resultpairPy[1] == [(0,0), (1,3)]:
    print("Test Passes - Python")
    print("You found: ", resultpairPy[1])
else:
    print("Test Failed - Python")
    print("You found: ", resultpairPy[1])
    
# Numpy Test
pairANp = np.array([1,2,3,4])
pairBNp = np.array([1,5,8,2])
resultpairNp = NpPairs(pairANp, pairBNp)
if np.array_equal(resultpairNp[1], np.array([(0,0), (1,3)])):
    print("Test Passes - Numpy")
    print("You found: ", resultpairNp[1])
else:
    print("Test Failed - Numpy")
    print("You found: ", resultpairNp[1])

In [None]:
# Timing Case
# This doesn't check if your code is correct, so no cheating!

inAPy = rn.sample(range(100000), 5000)
inBPy = rn.sample(range(100000), 5000)

inANp = rn.sample(range(100000), 5000)
inBNp = rn.sample(range(100000), 5000)

print("The pure Python method takes %s seconds" % PyPairs(inAPy, inBPy)[0])
print("The NumPy method takes %s seconds" % NpPairs(inANp, inBNp)[0])

****
## Question 2: **Fascinating Flowers**

As we are just starting on a data science path, we will start with one of the most famous datasets for beginners, the _Iris Flower Dataset_. Published by [Ronald Fisher](https://en.wikipedia.org/wiki/Ronald_Fisher) in 1936, the dataset has become a common test case for statistical machine learning methods. However today, we will be using it for learning NumPy.

<br>

**Picture this**: You've been successful in applying for an internship at the ANU, under the [Fenner School of Environment and Society](https://fennerschool.anu.edu.au). Known for their interest in the world we live in and famous for their [_What The Fluff?_](https://fennerschool.anu.edu.au/news-events/news/what-fluff) and [_A Buzz About Pollination_](https://fennerschool.anu.edu.au/research/research-stories/buzz-about-pollination-fenner-research-maps-bee-and-food-crops-place) style research pieces, researchers in the school want to know how they can incorporate data analytics more into their research. Being the genius you are, you decide that you will write code to demonstrate the strengths of data analysis, on a dataset they will take interest in: the Iris Flower Dataset.

Your task is to showcase how NumPy can be used for manipulating data and providing _meaningful_ insight into the data a user is interacting with, using the Iris Flower Dataset as your example dataset.

The dataset features the following information:

| Column Name    | Description                      |
| :------------: | :------------:                   |
| sepal_length   | Length of the Sepal (end to end) |
| sepal_width    | Width of the Sepal (end to end)  |
| petal_length   | Length of the Petal (end to end) |
| petal_width    | Width of the Petal (end to end)  |
| Species        | Type of Iris Flower              |

For those who are unsure what the table columns refer to, the sepal and petal is outlined on the diagram below

<img src="./img/diagramSepal.jpg" alt="Flower Diagram" style="width: 400px;"/>


****

### Q2.1: Importing Data & Making It Useable

There a number of standard actions that will need to be performed to get the data to a "useable" point, then we can start experimenting with the data. The following actions will get the data to a "useable point":
- Importing the data into a 2D NumPy array from the csv (csv location: `./data/IRIS.csv`)
- Reducing the data to only make use of the `sepal_length`, `sepal_width`, `petal_length` & `petal_width` columns. We will use the species column in later labs.

The NumPy function `genfromtxt` will assist you greatly in being able to perform the above actions. [Documentation is here](https://docs.scipy.org/doc/numpy/reference/generated/numpy.genfromtxt.html)

In [109]:
# Your Code Here
file = np.genfromtxt("./data/IRIS.csv", delimiter =',', skip_header =1, usecols=(0,1,2,3))
file

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.1],
       [5.4, 3.7, 1.5, 0.2],
       [4.8, 3.4, 1.6, 0.2],
       [4.8, 3. , 1.4, 0.1],
       [4.3, 3. , 1.1, 0.1],
       [5.8, 4. , 1.2, 0.2],
       [5.7, 4.4, 1.5, 0.4],
       [5.4, 3.9, 1.3, 0.4],
       [5.1, 3.5, 1.4, 0.3],
       [5.7, 3.8, 1.7, 0.3],
       [5.1, 3.8, 1.5, 0.3],
       [5.4, 3.4, 1.7, 0.2],
       [5.1, 3.7, 1.5, 0.4],
       [4.6, 3.6, 1. , 0.2],
       [5.1, 3.3, 1.7, 0.5],
       [4.8, 3.4, 1.9, 0.2],
       [5. , 3. , 1.6, 0.2],
       [5. , 3.4, 1.6, 0.4],
       [5.2, 3.5, 1.5, 0.2],
       [5.2, 3.4, 1.4, 0.2],
       [4.7, 3.2, 1.6, 0.2],
       [4.8, 3.1, 1.6, 0.2],
       [5.4, 3.4, 1.5, 0.4],
       [5.2, 4.1, 1.5, 0.1],
       [5.5, 4.2, 1.4, 0.2],
       [4.9, 3

In [None]:
#array[row_start:row_end, col_start:col_end]

In [110]:
file[0:9,:]

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2]])

<br>

### Q2.2: Finding Meaning
Now that we have some data that can be used, we can start finding statistical information in the NumPy data structure. Using NumPy functions, we want to find the statistical information that exists for the length and widths of the sepal and petal for the observed flowers.

With the knowledge of the first 50 entries in the dataset being the flower type `Iris-setosa`, entries 51-100 being `Iris-versicolor` and the last 50 entries being `Iris-virginica`, we want to determine the relative sizes of the entire Iris family, and the unique aspects of each Iris flower. This includes the following actions:
- Determining the mean, minimum and maximum Sepal and Petal dimensions for the entire Iris family
- Determining the mean, minimum and maximum Sepal and Petal dimensions for each Iris sub-species (Setosa, Versicolor and Virginica)
- Determining the standard deviation & variance of the Petal and Sepal dimensions for both the Iris family as a whole, and each sub-species.

With this information, we should be able to answer questions using the evidence found by interacting with the dataset. Find the metrics specified above and any others that you deem useful, and answer the text questions below.

In [118]:
# Your Code Here
entire_min = [np.min(file[:,0]), np.min(file[:,1]), np.min(file[:,2]),np.min(file[:,3])]
entire_max = [np.max(file[:,0]), np.max(file[:,1]), np.max(file[:,2]),np.max(file[:,3])]
entire_mean = [np.mean(file[:,0]), np.mean(file[:,1]), np.mean(file[:,2]),np.mean(file[:,3])]
print("Entire min, max, mean is: ", entire_min, entire_max, entire_mean)

setosa_splen = file[:50,0]
setosa_spwid = file[:50,1]
setosa_ptlen = file[:50,2]
setosa_ptwid = file[:50,3]
print("Setosa Sepal legnth minimum is: ",setosa_splen.min())
print("Setosa Sepal legnth maximum is: ",setosa_splen.max())
print("Setosa Sepal legnth meam is: ",setosa_splen.mean())
print("Setosa Sepal width minimum is: ",setosa_spwid.min())
print("Setosa Sepal width maximum is: ",setosa_spwid.max())
print("Setosa Sepal width meam is: ",setosa_spwid.mean())


    
    
        

Entire min, max, mean is:  [4.3, 2.0, 1.0, 0.1] [7.9, 4.4, 6.9, 2.5] [5.843333333333334, 3.0540000000000003, 3.758666666666666, 1.1986666666666668]
Setosa Sepal legnth minimum is:  4.3
Setosa Sepal legnth maximum is:  5.8
Setosa Sepal legnth meam is:  5.006
Setosa Sepal width minimum is:  2.3
Setosa Sepal width maximum is:  4.4
Setosa Sepal width meam is:  3.418


#### Q: What is the average petal width and length of a flower in the Iris family?

#### Q: What is the smallest flower (by petal dimensions) observed in the Iris-Setosa sub-species? Does this smaller size correlate to the Sepal dimensions?

#### Q: What is the largest sub-species of the Iris family? How did you determine this?

<br>
While these questions have been directly related to the data and are easily answered, some require a bit less procedural thought and a bit of problem solving. The following is designed to be a precursor to the machine learning work later in the semester

#### Q: Based on the size dimensions you found above, is there a way that you can categorise a type of flower based on it's petal and sepal dimensions? 
##### Provide a short description of how you would perform this, and provide a proof of concept in the form of code with some example cases.

In [None]:
## Your Proof of Concept here

<br>
Finally, remember the original context of this overall question. You were finding meaning in the data to show the Fenner School of Environment and Society how data science can be useful in their research. It would be reasonable to assume that a number of the people you would show this to would have varying levels of coding experience, some with no experience whatsoever. Before moving on, make sure your code is readable and commented so that you any person could understand what your code is doing.

****
## Question 3: Caring about Cars

Imagine you are trying to help your friends find their ideal dream car, based on the car's performance and fuel economy. In this scenario, we are going to walk through some steps to find an ideal choice of cars given some restrictions given to us, in a step by step case. In the future, you will have to come up with these steps yourself or intepret them from an initial briefing, so this is good practice!

The dataset we are using is a dataset of American cars. The dataset has the following schema:

| Column Name    | Description    |
| :------------- | :------------- |
| type           | The car's manufacturer and make/model       |
| mpg            | The cars fuel consumption, in _miles per gallon_ (mpg) |
| cyl            | The number of cylinders in the car's engine | 
| disp           | The combined swept volume of the pistons inside the cylinders of the car's engine |
| hp             | The car's horsepower |
| wt             | The weight of the car, in pounds |
| speed          | A relative measure of the top speed of the car, bespoke format |

We will need to do some alterations of the data to shift some metrics (mainly "mpg", "hp" and "wt") from Americanised metrics to more internationally accepted measurements, then we will be able to perform some analysis to find the ideal car based on some limitations. 

Firstly, lets import the dataset. Using the [read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) function will be ideal for this. Print out the first 5 rows to ensure the data is held correctly.

In [122]:
# Your Code Here
import pandas as pd
cars_data = pd.read_csv('./data/cars.csv')

### Question 3.1: Maladapted Metrics
Unfortunately, a lot of datasets have been formed in America (and the UK) and follow their conventions for data metrics such as miles, pounds, and horsepower. Before we work on this dataset, we aim to change that. Your task is as follows:
- Convert the mpg column's entries into litres/100km. Create a new column for these called l/100km and provide the converted value for each car in that row.
- Convert the hp column's entries into Kilowatts. Create a new column for these called kw and provide the converted value for each car in that row.
- Convert the wt column's entries into Kilograms. Create a new column for these called kg and provide the converted value for each car in that row.

Formulas:
- 235.215/(x mpg) = y L/100 km
- x hp / 1.341 = y kw
- 1 lbs / 2.205 = y kg

Further information on the formulas can be found online

In [124]:
# Your Code here
cars_data['l/100km'] = 235.215 / cars_data['mpg']
cars_data['kw'] = cars_data['hp'] / 1.341
cars_data['kg'] = cars_data['wt'] / 2.205
cars_data

Unnamed: 0,type,mpg,cyl,disp,hp,wt,speed,l/100km,kw,kg
0,AMC Ambassador Brougham,13.0,8,360.0,175.0,3821,11.0,18.093462,130.499627,1732.879819
1,AMC Ambassador DPL,15.0,8,390.0,190.0,3850,8.5,15.681000,141.685309,1746.031746
2,AMC Ambassador SST,17.0,8,304.0,150.0,3672,11.5,13.836176,111.856823,1665.306122
3,AMC Concord DL 6,20.2,6,232.0,90.0,3265,18.2,11.644307,67.114094,1480.725624
4,AMC Concord DL,18.1,6,258.0,120.0,3410,15.1,12.995304,89.485459,1546.485261
...,...,...,...,...,...,...,...,...,...,...
401,Volvo 145E (Wagon),18.0,4,121.0,112.0,2933,14.5,13.067500,83.519761,1330.158730
402,Volvo 244DL,22.0,4,121.0,98.0,2945,14.5,10.691591,73.079791,1335.600907
403,Volvo 245,20.0,4,130.0,102.0,3150,15.7,11.760750,76.062640,1428.571429
404,Volvo 264GL,17.0,6,163.0,125.0,3140,13.6,13.836176,93.214019,1424.036281


### Question 3.2: Chaotic Cars
Now that we have the data in a more internationally accepted fashion, you are tasked in finding the best car for your friend Ben. Ben doesn't know much about cars but cares for the environment, so there will be considerations for the fuel economy of the car. Your task is as follows:
- Find the top 15 cars with the lowest l/100km rating. These cars used the least amount of fuel when travelling.
- While Ben wants to be fuel efficient, he also wants to be able to race his friend Afzal. From the top 15 cars, find the top 10 that have the highest Power-to-Weight ratio. This is calculated by: `kw / kg`
- With the choice narrowed down to 10, make an argument for which 3 cars would be best for Ben and why. Consider that he wants to race Afzal so speed is important, and the number of cylinders might also impact how the car may behave (more cylinders generally means larger displacement because more petrol can get into the engine at any one time).

Provide your code below for the first two tasks, and provide text and code responses (evidenced by the data) for which cars are best for Ben.

In [153]:
# Your Code Here
top_15 = cars_data.sort_values(by=['l/100km']).head(15)
cars_data['Power_to_Weight'] = cars_data['kw'] / cars_data['kg']
top_10 = top_15.sort_values(by=['Power_to_Weight']).tail(10)
top_10.drop(columns=['Power_to_Weight','Power_to_weight'])

Unnamed: 0,type,mpg,cyl,disp,hp,wt,speed,l/100km,kw,kg,power_to_weight
357,Toyota Corolla Tercel,38.1,4,89.0,60.0,1968,18.8,6.173622,44.742729,892.517007,0.050131
247,Mazda GLC,46.6,4,86.0,65.0,2110,17.9,5.047532,48.47129,956.9161,0.050654
117,Datsun 210,40.8,4,85.0,65.0,2110,19.2,5.765074,48.47129,956.9161,0.050654
373,Toyota Starlet,39.1,4,79.0,58.0,1755,16.9,6.015729,43.251305,795.918367,0.054341
130,Datsun B210 GX,39.4,4,85.0,70.0,2070,18.6,5.969924,52.199851,938.77551,0.055604
237,Honda Civic,38.0,4,91.0,67.0,1965,15.0,6.189868,49.962714,891.156463,0.056065
290,Plymouth Champ,39.0,4,86.0,64.0,1875,16.4,6.031154,47.725578,850.340136,0.056125
395,Volkswagen Rabbit,41.5,4,98.0,76.0,2144,14.7,5.667831,56.674124,972.335601,0.058287
232,Honda Civic 1500 GL,44.6,4,91.0,67.0,1850,13.8,5.273879,49.962714,839.002268,0.05955
340,Renault Lecar Deluxe,40.9,4,85.0,,1835,17.3,5.750978,,832.199546,


In [None]:
# Extra Code Cell as necessary

So what do you think of your choice? Check with those around you in the lab (or your friends if you are doing this at home) to see what cars others have chosen, and why. In groups, come up with the best car and state your case to your tutor. After that, you're done!

*****

## Homework & Extension Questions
No formal homework is set for this week.

*****
## Resources
- [Numpy Manual v1.17](https://docs.scipy.org/doc/numpy/)
- [Pandas Docs v0.25.3](https://pandas.pydata.org/pandas-docs/stable/)