# Introduction to Python For Data Science - Session 2

# City Data Science Society 2019

Before we start this session, make sure that you have uploaded the csv files to the files section on the left hand side of the Jupyter Notebook.

While working through the notebook make sure you run all the code cells included in the question.

## Section A - Python Recap

### 1. Methods

Write a method that prints out all the numbers between 1 and 100 that are divisible by 2 or 5

### 2. Lists

#### Split

Sometimes we may want to split up a string into smaller parts so that we can perform operations on it easier.

Python has an operation that can help us split up strings called .split(" "). The input parameter for .split() can be a space, or any other character such as ",".

If you use .split() without an input parameter, it will default to splitting your string by its spaces.

In [None]:
string = "The quick brown fox"

print(string.split())

In [None]:
string = "The,quick,brown,fox"

print(string.split(","))

The benefit of this operation is that you can now use a For loop over the list to perform operations on the words.

#### Task

Write a method that asks for an input of integers (whole numbers) separated by commas, and turns this input into a list.

EG

input:

1,2,65,2,114,5

output:

[1, 2, 65, 2, 114, 5]

### 3. Data Cleaning

In Python you can change a variable's data type relatively easily. 

For example, to convert a number to a string you would use str(34). To convert a string to a number you would use int("7").

If you are after a variable with decimal places, you can use a float. This is implemented with float("4.3222").

##### Hint: Remember that you can iterate through a list of values by using the below loop.

In [None]:
listOfWords = ["Glass", "Table", "Caramel"]

for index in range(len(listOfWords)):
    
    word = listOfWords[index]
    
    print(word)

#### Task

Given the list below, create a method that iterates through the list and outputs a version of the list with all values the same data type.

In [None]:
dirtyData = [1, 3.6, 7.555555555, 10, 0.00001, "35"]

### 4. Dictionaries

Dictionaries are very useful tools for storing data, and are used in Pandas to create DataFrames.

The key of the dictionary represents the column name.

The corresponding values in the dictionary represent the data.

In a dictionary this relationship is visualised as follows:

{"columnName" : data}

#### Creating Dictionaries Method 1

The column names in our dictionaries can just be strings, however the data that we would like to use will need to be in a list.

In [None]:
myDictionary = {"Height": [30,40,50,20,45], "Width": [20,43.5,343,44.5,50], "Length": [40,55.3,77,20,30]}

Notice that dictionaries are created with curly brackets {}, and lists are created with square brackets [].

We need to ensure that all of the data we use to create our dictionaries are of equal length. That means that in this case, the lists of values for "Height", "Width" etc need to be the same length.

#### Task

Think of a use case for storing data. This could be storing information about people and their height or age, or it could be about recording the prices of cakes.

Create a dictionary using the method shown above to store this data and print the resulting dictionary.

### 5. Advanced Dictionaries

We can combine lists together to automate the dictionary creation process.

#### Creating Dictionaries Method 2


The zip() function below pairs up the two lists, eg "Name" to ["Ben", "Jacob", "Nikolaos"], "Age" to [43, 54, 75].

The dict() function converts the zipped lists into a dictionary.

In [None]:
columns = ["Name", "Age"]

data = [["Ben", "Jacob", "Nikolaos"], [43, 54, 75]]

myDictionary = dict(zip(columns, data))

myDictionary

#### Task

Create and display a dictionary that has two columns: "Fruit" and "Price", and contains rows of fruit and their corresponding prices.

## Section B - Pandas

### 1. DataFrames

A DataFrame is Pandas terminology for a table of data which contains columns and rows of data and can be edited fluidly in Python.

#### Importing

This library must be imported before it can be used. 

In Python we can change the names of libraries so that they are easier to work with later on.

In [None]:
import pandas as pd

We are naming this library 'pd' so that we can reference it quickly in our code.

#### Initialising DataFrames

As seen below we can simply parse our dictionary from before into pd.DataFrame(), and it will create a DataFrame for us.

In [None]:
data = {"Name": ["Ben", "Jacob", "Nikolaos"], "Age": [54, 67, 23]}

myDataFrame = pd.DataFrame(data)

myDataFrame

#### Task

1. Initialise a dictionary with some columns and some data. (Or use the dictionary you created in the previous task)

2. Create a DataFrame from this dictionary and print it.

### 2. Accessing Rows And Columns

#### Rows

We have a few different tools at our disposal to access rows of data within our DataFrame.

The first of which is the .loc[[]] method eg (myDataFrame.loc[[]]), which will return the row with the index that we specify.

First though, we need to change the index of the DataFrame so that we can reference the Name column directly.

In [None]:
myDataFrame = myDataFrame.set_index("Name")

myDataFrame

We can now use the .loc[[]] method to access just the "Jacob" row.

In [None]:
myDataFrame.loc[["Jacob"]]

Alternatively, we can use the .iloc[[]] method to retrieve the same row using its position in the database rather than the index.

In [None]:
myDataFrame.iloc[[1]]

#### Columns

Now that we can access rows, we can start to look at accessing columns.

To access one column, we can simply write the name of the column within square brackets next to the DataFrame.

In [None]:
myDataFrame['Age']

This gives us a 'series' - Pandas terminology for a list with indexes. We can convert this series to a list using the list() wrapper.

In [None]:
list(myDataFrame['Age'])

Now that we have this data in a list we can iterate through it and perform operations like we have seen earlier in the course.

#### Task

Write a line of code that will retrieve the row with the index of "Ben" from myDataFrame using either the loc or iloc method.

### 4. Importing External Datasets

The most popular data format is the .csv file, standing for comma separated values.

The Pandas library makes it very easy to import datasets, since we can use the .read_csv() function.

The dataset we will be using today is imaginatively called 'data.csv' and must be uploaded to the Jupyter Notebook before it can be imported. This is an arbitrary dataset comparing the value of X to Y.

Drag and drop the dataset to the left hand side of your screen to add it to the list of files available.

In [None]:
df = pd.read_csv("data.csv")

We've called this DataFrame df, but you can change this to what ever you like.

We can view the contents of the dataset we just uploaded by simply writing the name of out dataset in a cell.

In [None]:
df

This will print the entirety of our dataset, however it's sometimes more useful to just view the first few rows to check that everything's formatted correctly. By default, the .head() method will print the first 5 rows of the DataFrame, but we can change that by parsing in a different number.

In [None]:
df.head(10)

Let's find out some information about our dataset.

We can run the .info() method to see the size of the dataset and information about its values.

In [None]:
df.info()

This is usefully telling us that this dataset has 63 rows and two columns. It also tells us that all the values are integers, so we're dealing with whole numbers.

##### But what more is there to see?

We can perform .describe() on the dataset to view statistics about our values.

In [None]:
df.describe()

It's always useful when importing a new dataset for the first time to run these methods and get a picture of the type of data you are going to be dealing with.

#### Task 

Import the dataset "olympics.csv", and view the contents. Run the statistical and info methods on this dataset to investigate its contents.

Make sure to name this DataFrame something different from the dataset we have been working on so far.

## Section C - Matplotlib

Matplotlib is a popular graphing library within Python. Today we will be looking at a tool within Matplotlib called PyPlot.

Note: For Section C we will also be referencing the data.csv dataset.

In [None]:
import matplotlib.pyplot as plt

We can use this library to graph our DataFrames with the example below.

In [None]:
plt.scatter(df['X'], df['Y'])
plt.show()

#### Task

Try plotting the data in the olympics.csv datatset using the method described above.

Have a look at the documentation for PyPlot online and add labels for the axis, and a title for the graph

## Extension Task

Using the olympics.csv DataFrame just imported, write a method that will print every third row, only if the country name starts with an 'A' or 'G'. 

##### Hint: You will need to look at using either .iloc or .loc for this.