#  Introduction to Python (session 1) <a class="tocSkip">

Content adapted from Francesca Pontin's Intrduction to Python Course (please contact: F.L.Pontin@leeds.ac.uk) <a class="tocSkip">

# What is Python? Getting started
In this session we will cover:
- User interfaces
- Packages, modules etc.
- Data types

## Using Jupyter notebooks

Welcome to coding in Python! This course is designed to introduce you to the basics of coding in Python and then get you to apply your new found coding skills to carry out some data analytics! 
This session is written in a Jupyter Notebook. A Jupyter Notebook ["is an open source web application that you can use to create and share documents that contain live code, equations, visualizations, and text"](https://realpython.com/jupyter-notebook-introduction/). 

A Jupyter Notebook is made up of cells (the grey boxes). These cells can be text (known as markdown), code or visulisations/images, to name a few. 
To select a cell click anywhere in the box and a large blue box will appear arround it. To edit a cell click in it twice and the surrounding box will turn green. <br>
As shown below.


![image.png](attachment:image.png)

![image.png](attachment:image.png)

During this session you will read through the Jupyter Notebok, running sections of code yourself to learn key data analysis skills. There will also be places to edit and write your own code within the notebook.

#### <font color='orchid'>Instructions and tasks for you to complete are in purple</font> <a class="tocSkip">
    
#### Where you have to write your own code answers are provided at the end of the section <a class="tocSkip">

### How to run a cell of code
To run code select the cell you want to run (the blue box will appear arround the selected cell) and then EITHER:

1) type CTRL + ENTER on your key board (shift + enter for mac)

OR

2) click "Run" on the Jupter Notebook the menu above 
<img src="Intro_to_Python\running_code.png" width=500>





### How do I know if I have run the cell?
If the cell of code has been run a number will appear in squared brackets by the cell e.g. <code>In [1]: or In [12]:</code>

The numbering refers to the order in which you have run the cells

An un-run cell of code has an empty set of squared brackets by the cell I.e. <code>In []:</code>

### How do I add a new cell?<a id='new_cell'></a>

To add a new cell select a cell (so it is surrounded by a blue box) and type 'b' to add a cell bellow or 'a' to add a cell above.

To delete a cell select it and double tap 'd'.

## Hello World

Following coding tradition the first thing we are going to program is to print the words "Hello World".

The <code>print() </code> function prints the specified message to your screen

The speech marks <code>" "</code> arround Hello World let python know you are typing text (also known as a string).

 <font color='orchid'> <b>Run the code below</b></font>

In [None]:
print("Hello World")

## Basic maths

Lets do some basic maths 
<br><font color='orchid'> <b>Run the code below</b></font>

In [None]:
2+2

If we print 2+2 we get the same thing
<br> <font color='orchid'> <b>Run the code below</b></font>

In [None]:
print(2+2)

If we <code>print("2+2")</code> however we get the characters 2 + 2 as the speach marks tell Python we are typing text and not numerical characters


<br> <font color='orchid'> <b>Run the code below</b></font> and see for yourself

In [None]:
print("2+2")


## Assigning values to variables

We can also assign the sum 2+2 to a variable. 

In this case we have named the variable <font color='blue'>answer</font>. We can now type <font color='blue'>answer</font> instead of 2+2 every time.

<br> <font color='orchid'> <b>To assign 2+2 to the variable answer run the code below</b></font>

In [None]:
answer = 2+2
print(answer)

We can also add our created variable to a new equation
<br> <font color='orchid'> <b>To add 5 to the variable answer run the code below</b></font>

In [None]:
# we can also add our created variable to a new equation
answer + 5

NOTE: <code>#</code> When a hashtag is added to a line of code the rest of that line is not treated as code.
This can be used to add comments to your code so you know what it is doing (especially useful when you come back to it later on!)

## Lists

Remember lists are an ordered collection of one or more data item.  Defined by square brackets <code>[]</code>

<br> <font color='orchid'> <b>Look back at the lecture notes and create a list called fruit with the following data items: 'apple', 'pear', 'banana', 'strawberry' (note these are strings)</b></font>

In [None]:
# Type code here

We can check that you have created a list using the <code>type</code> function

<br> <font color='orchid'> <b>Enter <code> type(fruit) </code> into the cell below</b></font>

In [None]:
# run this code


### Lists and basic maths

We can also create a list of numbers.  <br> <font color='orchid'> <b>Run the code below </b></font>. What happens when we try to multiply the list by 3?

In [None]:
numbers = [1,3,5,7,9,11]
numbers*3

### List comprehension

To multiply each item in the list (and create a new list of the results) we need to select each value <code>i</code> in the list:
<code>[i*3 for i in numbers]</code>

I.e. for each element 1 to i in the list named numbers, multiply that element by 3

<br> <font color='orchid'> <b> Run the code to see</b></font>

In [None]:
# multiply element by 3, repeat for every element in the list 'numbers'
new_numbers = [i*3 for i in numbers]

# print the new list
print(new_numbers)

<div class="alert alert-block alert-warning">

### A very quick introduction to for loops

Making a list based on an old list is known as list comprehension. An alternative is a for loop as shown below.

We will come back to these later, so for now just <font color='orchid'> <b> run the code and read the comments explaining what each step does. </b></font>

In [None]:
# create a new empty list
my_new_list = []

# for element in the list 'numbers': 
for i in numbers:
    # multiply element by 3 and append the result to the new empty list        
    my_new_list.append(i * 3)

# print the new list
print(my_new_list)

We can also multiply by another defined variable. For example we previously defined the variable <code>answer</code> (<code>answer = 2+2</code>)

In [None]:
# multiply element by variable 'answer', repeat for every element in the list 'numbers'
[i*answer for i in numbers]

Commonly you will see <code>i</code> and <code>j</code> to define elements in a list. However this is just convention and you could use anything you wanted to refer to the elements in the list e.g. <code>elephants</code>

(Though typing <code>i</code> is a lot quicker)

In [None]:
# multiply element by 3, repeat for every element in the list 'numbers'
[elephants*3 for elephants in numbers]

## First look at data frames

### .head() and .tail() functions
To understand how to explore data frames and different data types in python we are going to use a set of data about passengers on the titanic. This is an example dataset built into the seaborn python package. 

We will go into detail about reading in data and loading packages in the next exercise, <font color='orchid'> <b>for now run the cell of code below. </b></font>

Note the <code>.head()</code> function shows the top 5 lines of the data frame. 

In [None]:
# Import the seaborn package
import seaborn as sns

# load the titanic example dataset and save it as a dataframe named titanic
titanic = sns.load_dataset('titanic')

# look at the first 5 rows of the dataframe
titanic.head()

Note NaN denotes a cell containing no data - a null cell

<font color='orchid'> <b> Try entering and running <code>titanic.tail()</code> instead of <code>.head()</code> in the cell below. </b></font> What view of the dataframe do you think you are now seeing? 

In [None]:
# enter the instructed code here


<div class="alert alert-block alert-info">
    
<I> A Quick description of the titanic data variables:
- <b>survival:</b>    If the passenger survived
- <b>PassengerId:</b> Unique Id of a passenger. 
- <b>pclass:</b>    Ticket class
- <b>sex:</b>   Sex     
- <b>Age:</b>   Age in years     
- <b>sibsp:</b>    Number of siblings / spouses aboard the Titanic     
- <b>parch:</b>   Number of parents / children aboard the Titanic     
- <b>ticket:</b>   Ticket number     
- <b>fare:</b>   Cost of the passenger fare     
- <b>cabin:</b>  Cabin number     
- <b>embarked:</b>    Port of Embarkation</I> 

### Data frame columns

We might just want to get a list of the columns in the dataframe to give us a quick idea of what data we have present. To do this we can use the <code>.columns</code> function after we name the dataframe.

<font color='orchid'> <b> Enter <code>titanic.columns</code> in the cell below. </b></font>. The columns listed should be the same as the columns in the <code>.head()</code> view of the dataframe. 


In [None]:
# enter the code here


#### Referring to a column in a dataframe

If we want to select a single column of the dataframe we can also do that.

There are several ways to refer to a column in a pandas dataframe. 

The easiest way is by putting the name of the column in square barckets and speech marks <code>[" "]</code> after the name of the dataframe.
e.g. <code><font color='blue'>dataframe_name</font>["<font color='blue'>column_name</font>"]</code>
    
<font color='orchid'> <b> Try to select the 'fare' column from the titanic dataframe. 

In [None]:
# enter the code here


*Note: To save space only a snapshot of the column is shown and not all the rows

### Data frame index
When we type <code> dataframe_name["column_name"]</code> we get the values of the column but we also see the index (the row names). In this case the rows are just numbered 0:890. But these could be other values such as passenger names.

Python indexing starts at 0 not 1. So the first row is row 0 and the first column column 0. 

<code>.index</code> works the same as <code>.columns</code> but this time shows the row names. 

<font color='orchid'> <b> Run <code>titanic.index</code> in the cell bellow </font> 

In [None]:
# enter the code here


We can see the index starts at 0, stops after 891 enteries and increases by a step of 1 for each row.

### Data frame shape
To get the number of rows and columns of a dataframe we can use <code>.shape</code>

<font color='orchid'> <b> Run <code>titanic.shape</code> in the cell bellow </font>

In [None]:
# enter the code here
titanic.shape

## Data Types

### Data types recap

As we covered in the lecture there are different types of data: 

<b>Objects:</b> also known as strings or written characters/text in plain english e.g. Hello World

<b>Intergers:</b> Whole numbers e.g. 2, 57 or 109567835

<b>Floats:</b> A number with a decimal place e.g. 2.34534, 5.5 or 1.0

<b>Boolean:</b> True or False data type
<br>

<b>Datetime:</b> Values that are either a date, time or both e.g. 2019-10-31 09:26:03.478039 (9:26 am on Halloween 2019)

<b>Category:</b> A fintie list of text values E.g. London, Paris, Berlin, Rome (There are a finite number of captial cities)

Learn more about python data types using this realpython [online resource](https://realpython.com/python-data-types/)
<br>
<br>
<br>

### Checking the data type

We can check the data type of each column in a dataframe using the <code>.dtypes</code> funciton 
<br> <font color='orchid'> <b>Run the code below</b></font> and have a look at the data type of each of the columns. Are they all as you expected?

In [None]:
titanic.dtypes

<code>.info()</code> gives us slightly more information including: 
- data types: (<code>.dtypes</code>)
- null counts: number of rows containing non-null values
- memory usage: how much computer memory the table uses (useful to know to stop your code running)

If a column has fewer non-null values than the total number of rows this indicates that data might be missing. 
<br> <font color='orchid'> <b>Run the code below</b></font>

In [None]:
titanic.info()

<div class="alert alert-block alert-info">

## Check your answers before moving on
    

Answers to the enter your own code sections above. 

<b>1.5 Lists <font color='orchid'> Create a list called fruit with the following data items: 'apple', 'pear', 'banana', 'strawberry' (note these are strings)<font color='orchid'></b>
 
 ![fruit.png](attachment:fruit.png)   

    
<b> 1.5.1 Lists and basic maths<font color='orchid'> What happens when we try to multiply the list by 3? <font color='orchid'> </b> The list repeats 3 times:
![number_list.png](attachment:number_list.png)
    
    
<b> 1.6.2 Data frame columns <font color='orchid'> Enter titanic.columns in the cell below. <font color='orchid'> </b>
![titanic_columns.png](attachment:titanic_columns.png)
    
    
<b> 1.6.2.1  Referring to a column in a dataframe<font color='orchid'> Select the 'fare' column from the titanic dataframe.<font color='orchid'> </b>
    
![titanic_fare.png](attachment:titanic_fare.png)
    

<div class="alert alert-block alert-info">

# Reading in, exploring and summarising data

- Importing packages
- Reading in data (CSV)
- Exploring data frames
- Data summarisation
- Data cleaning

## Importing Packages

Python has a lot of basic functionality built in e.g. the maths we have just done, but a lot of the time while doing data analysis you will require more than the basic functionality. This is when you need to import packages.
If you have seen anyone else's code before you will note people tend to install all packages at the beginning of their code.

To import a package use the statement <code>import</code> followed by the <code>package_name</code>.

<br> <font color='orchid'> <b>Run the code below</b></font> to import the pandas package. <br>(nothing will appear in the cell but a number will appear in the squared brackets)



In [None]:
import pandas

### The pandas package
The pandas package allows us to easily create and handle dataframes, similar to Excel. Data is put into an easily readable format of columns and rows.

It also allows us to read in data from a CSV file or excle file.

Often when we use the package we refer to it for example <code>pandas.read_csv()</code>

### Note on abbreviating packages
To save us having to type out pandas everytime we can abbreviate pandas to pd using the code bellow (<font color='orchid'> <b>Run the code below</b></font>). 

This is often done for commonly used packages to save time and reduce the likelihood of typos while coding.

So <code>pandas.read_csv()</code> becomes <code>pd.read_csv()</code>.

In [None]:
import pandas as pd

 Now let's install some of the packages we will need for the rest of the exercises
 
<font color='orchid'> <b>Run the code below</b></font>

In [None]:
# Import other required packages

# packages for visulising data
import matplotlib.pyplot as plt
import seaborn as sns

## Reading in Data

#### About the data <a class="tocSkip">   

I have fabricated some data about buying coffee (my favourite thing to do) and coffee prices close to the University. We will use this data in parts fo the workshop later on. Before we get started with any analysis we need to import the data.

##### Data format <a class="tocSkip">   

The data is originally in a CSV (comma separated variable) format. Most data in excel can be saved as a CSV and commonly downloaded data comes as a CSV file.
    
Example of CSV file:

<img src="screenshots/csv_screenshot.png" width="600" />

### Reading in CSV files

To read the data into a dataframe we will use the pandas package introduced above, using the <code>pd.read_csv()</code> function.


The <code>pd.read_csv()</code> function requires a 'filepath', i.e. we need to tell python where to find the data. To do this we put the filepath in speechmarks within the brackets of the function.
I.e. <code>pd.read_csv("<font color='blue'>file_path</font>")</code>

There is a <a href='#file_paths'>quick note about file paths</a> below if these are new to you.

<br> <font color='orchid'> <b>Run the code below</b></font>

In [None]:
coffee = pd.read_csv("Intro_to_Python/coffee_data.csv")
# view first 5 rows of the coffee datafame
coffee.head()

##### So what have we done in the above code?

We have told pandas to read in a csv file and provide the path to where the CSV file can be found; saved in a folder within my documents.

We have also told pandas to assign the dataframe the name 'coffee'
It is useful to have logical dataframe names  rather than df1, df2, etc. So you know what you are referring to when you come back to your code.

## Summary statistics

We are going to go back to the titanic data now to look at some summary statisitcs. 

To get an overview of all the data in a data frame we can explore the different columns (variables) and get some basic summary statisitcs by using the function <code>.describe()</code>
<br>This is only possible for columns with a numeric (interger or float) data type.
<br> <font color='orchid'> <b>Run the code below, mkaing sure you understand the ouput</b></font>

In [None]:
titanic.describe()

### Summarising a single column
<code>.decribe()</code> can also be used on just one column:

<b>Remember:</b> The easiest way to refer to a column is by putting the name of the column in square brackets and speech marks [" "] after the name of the dataframe. e.g. <code>dataframe_name["column_name"].describe()</code>

<b> <font color='orchid'> <b> Write code to describe the "age" column of the titanic dataframe </b></font>

In [None]:
titanic['age'].describe()

or a subset of columns:

<br> <font color='orchid'> <b> Run the code below to describe the "age" and survived columns of the titanic dataframe </b></font>

In [None]:
titanic[['age','survived']].describe()

Note to select more than one column we use double square brackets [[ ]] as we are selecting a <b>list</b> of columns

### Mean, median, mode, quantile, std, count
We can also just get one of the metrics from the <code>describe()</code> function e.g. <code>.mean()</code>, <code>.quantile(0.75)</code>, <code>.max()</code>

Use the age column of the titanic data to explore these functions in the cells below.

#### Mean
<code>.mean()</code>
<br> <font color='orchid'> <b> Run the code below to calculate the mean "age" </b></font>

In [None]:
titanic['age'].mean()

#### Median
<code>.median()</code>
<br> <font color='orchid'> <b> Run the code below to calculate the median "age" </b></font>

In [None]:
titanic['age'].median()

#### Quantile
<code>.quantile(q)</code><br>
Where: <br>
q: Quantile or sequence of quantiles to compute, which must be between 0 and 1 inclusive.

<br> <font color='orchid'> <b> Run the code below to calculate quantiles for the "age" variable </b></font>

In [None]:
# Get the 75% quantile
titanic['age'].quantile(0.75)

<br> <font color='orchid'> <b>Write your own code below to get the 25% quantile</b></font>

In [None]:
# Get the 25% quantile


#### Maximum and Minimum

<code>.max()</code> <code>.min()</code>
<br> <font color='orchid'> <b> Run the code below to get the oldest passenger</b></font>

In [None]:
# get the oldest passenger
titanic['age'].max()

<br> <font color='orchid'> <b>Write your own code below to get the youngest passenger</b></font>

In [None]:
# get the youngest passenger


#### n smallest and largest

<code>nsmallest(n=)</code><code>nlargest(n=)</code>

Sometimes it is useful to be able to look at n number of values at the extreme of the data. This can especially be useful to identify outliers in the data. 

<br> <font color='orchid'> <b>Write your own code below to identify the 10 youngest and 5 oldest passenger</b></font>

In [None]:
# 10 youngest passengers 


In [None]:
# 5 oldest passengers 


## Data Cleaning

Data cleaning is the process of replacing, modifying, or deleting records from a set of data that have been identified as incomplete, incorrect, inaccurate or irrelevant.

### Checking for missing data

Often the first step in data cleaning is to check for missing data. Earlier on we used <code>.info()</code> to get an overview of the titanic data frame and identify the number of non-null values in each columns. Remember <code>NaN</code> denotes a cell that contains no data (a null object).

<br> <font color='orchid'> <b>Re-run <code>titanic.info()</code> to remind yourself</b></font>

In [None]:
titanic.info()

Equally we can used <code>.isna()</code> or <code>.isnull()</code> which both have the same function; to identify if the cell in the dataframe contains no data. 

<br> <font color='orchid'> <b>Run both sets of code below to see which rows contain null data.</b></font>

In [None]:
titanic.isna()

In [None]:
titanic.isnull()

This shows us row by row if the value is null or not, however it would also be useful to know the total number of null rows. To do this simply add <code>.sum()</code> after <code>.isna()</code> or <code>.isnull()</code>.

In [None]:
titanic.isnull().sum()

From this we can see a few passengers are missing age data, 2 are missing data on where they embarked but a lot are missing data on which deck they were on

There are many ways to deal with missing data and it is up to you how you choose to handle missing values. For the sake of this exercise I will show you a few methods.

#### Drop unwanted columns

As so many passengers are missing data on which deck they are on we are going to remove or 'drop' the deck column from the titanic dataframe. 

To do this we use the <code>.drop()</code> function.
We have to specify:
- the lable of the column (or index) we want to drop <code>['deck']</code> 
- the axis to be dropped: whether to drop labels from the index (0 or ‘index’) or columns (1 or ‘columns’).

Full information about the <code>.drop</code> function can be found here https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html 

In [None]:
titanic.drop(['deck'], axis=1)

In [None]:
# or
titanic.drop(['deck'],axis='columns')

In [None]:
# or
titanic.drop(columns =['deck'])

However, if we now check the titanic columns 'deck' is still there,

In [None]:
titanic.columns

This is because we have not replaced the orginial titnaic dataframe with the new titanic dataframe where the columns have been dropped. To do this we simply need to write <code> titanic = titanic.drop(columns =['deck'])</code>

In [None]:
titanic = titanic.drop(columns =['deck'])
titanic.columns

Now the titanic dataframe does not contain the 'deck' column. 

Be careful to check the code has done what you want it to before when redefining dataframes as you may accidentally loose data or information

### Dropping unwanted rows

In the case of the passengers with missing embarking inforamtion we may want to remove these two passengers form the dataframe.

#### Droppping null rows

Similar to <code>.drop()</code> and <code>.isna()</code> we can used <code>.dropna()</code> which drops rows with null values in a particualr column e.g.

<code>titanic['embarked'].dropna()</code> drops rows where 'embarked' is NaN

In [None]:
titanic['embarked'].dropna()

In [None]:
titanic.loc[titanic['age']== 10].shape[0]

### Using .loc to access groups of rows and or columns

We can also look at the rows where 'embarked' is NaN.

The <code><font color='blue'>dataframe_name</font>.loc[]</code> attribute of pandas dataframes allows us to access rows and/or columns of the dataframe we want to edit or view  


<br><font color='orchid'><b> Run through the next few cells and make sure you understand how <code>.loc</code> is working to select the different columns and rows </b></font>

In [None]:
titanic.loc[titanic['embarked'].isna()]

We can also look at rows where the column contains a certain attribute or attributes. For example we can see which passengers embarked in Southampton using <code>titanic.loc[titanic['embark_town']=='Southampton']</code>

<b>Note:</b> 
- the double equals sign == is used to compare the values (whereas a single = is used to assign a value to a variable)
- Southampton is a string so we use ' '

In [None]:
titanic.loc[titanic['embark_town']=='Southampton']

<code>.loc[]</code> can also be used for numbers. E.g. find where the fare paid is £30

In [None]:
titanic.loc[titanic['fare']==30]

For numeric columns we can also use other comaprison operators other than <code> == </code>


![comparison_opp.jpg](attachment:comparison_opp.jpg) Source: https://data-flair.training/blogs/python-comparison-operators

Write code using the loc funciton to answer the following questions:
- How many passengers paid less than £30 (number of rows = number of passengers)?
- How many passengers were male (number of rows = number of passengers)?
- How many passengers were 75 or older? 
- How many passengers did not embark in Southampton?

In [None]:
# How many passengers paid less than £30 (number of rows = number of passengers)?


In [None]:
# How many passengers were male (number of rows = number of passengers)?


In [None]:
# How many passengers were 75 or older? 


In [None]:
# How many passengers did not embark in Southampton?


We might just be interested in a variable dependent on another variable value. E.g. The boarding class of passengers who paid over £100. 

In [None]:
titanic.loc[titanic['fare']>=100, 'class']

### Rounding data

The youngest passenger in the titanic data is 0.42 years old (~aprox. 5 months). However, the age of all passengers over 1 is rounded to the nearest year. 

We can clean the data so that all participant ages are to the nearest year. Wecan do this using the <code>.round(n)</code> function, where n is the number of decimal places

In [None]:
# first identify passengers under 1
titanic.loc[titanic['age']<1]

In [None]:
# round age to whole number (0 decimal places)
titanic['age'] = titanic['age'].round(0)

In [None]:
# re-identify passengers under 1
titanic.loc[titanic['age']<1]

As all the other children under 1 were older than 6 months their age is rounded up to 1, so only 1 child aged 0 remains. 

In [None]:
titanic_child = titanic.loc[titanic['who']=='child']
titanic_adult = titanic.loc[titanic['who']!='child']

### Manually correcting data

Sometimes due to various reasons (human error, misclassification, analytical purpose) you might want to overrwrite data.

In this case we are going to look at the titanic data and see how children are classified. 

<br> <font color='orchid'> <b>Run the code below to show the oldest age of children aboard the titanic.</dont></br>

In [None]:
titanic.loc[(titanic['who']=='child')]['age'].max()

Looking at this data it suggests that anyone over the age of 15 might be classified as an adult. 

We can check this theory by selecting everyone who is not a child <code> titanic.loc[(titanic['who']!='child')</code> and using the <code> & </code> symbol to get everyone in this selected group who is under 18. 
<code>(titanic['age'] < 18)]</code>

In [None]:
titanic.loc[(titanic['who']!='child') & (titanic['age']<18)]

In [None]:
titanic.loc[(titanic['who']!='child') & (titanic['age']<18)].shape[0]

We can see there are 30 passenegers who are classified as 'man' or 'woman' (i.e. not children) but who are under 18 years old. We might instead to classify those under 18 as children.

We can do this by adapting the code we just used but his time selecting just the 'who' column <code>, 'who']</code>
<br> <font color =orchid> <b>Run the code below </font></br>

In [None]:
titanic.loc[(titanic['who']!='child') & (titanic['age']<18),"who"]

We can then overwrite these values by assigning the selected 'who' column values to 'child'. using the code <code>= 'child'</code>

In [None]:
titanic.loc[(titanic['who']!='child') & (titanic['age']<18),"who"] = 'child'

To test this has worked re-run the code selecting those who are not children but are under 18 (this should now not return any rows).

In [None]:
titanic.loc[(titanic['who']!='child') & (titanic['age']<18)]

In [None]:
titanic.head()

<div class="alert alert-block alert-info">

## Check your answers
    

Answers to the enter your own code sections above. 

<b>2.3.1 Summarising a single column <font color='orchid'> Write code to describe the "age" column of the titanic dataframe<font color='orchid'></b>
 
<img src="screenshots/titanic_describe.png" width=1000 >

    
<b> 1.5.1 Lists and basic maths<font color='orchid'> What happens when we try to multiply the list by 3? <font color='orchid'> </b> The list repeats 3 times:
<img src="screenshots/number_list.png" width=1000 >
    
<b> 1.6.2 Data frame columns <font color='orchid'> Enter titanic.columns in the cell below. <font color='orchid'> </b>

<img src="screenshots/titanic_columns.png" width=1000 >   
    
<b> 1.6.2.1  Referring to a column in a dataframe<font color='orchid'> Select the 'fare' column from the titanic dataframe.<font color='orchid'> </b>
 
<img src="screenshots/titanic_fare.png" width=1000 >    


<div class="alert alert-block alert-warning">

## Extra task

If you have time, load the coffee dataset (it may already be loaded) and answer the following question using your new found coding skills. 
    
1. How many columns and rows does the coffee data frame have?
2. What are the data types of the different columns in the coffee dataframe?
3. Is there any data obviously missing from the dataframe?
4. Using <code>.describe()</code> summary statistics for which variables are shown, why only these variables
5. What is the average (mean) price of coffee?
6. What is the standard deviation in coffee price?
7. What is the median coffee rating?
8. Using <code>.nlargest()</code> how many coffees scored 5/5?

In [None]:
# check coffee data frame has be read in using coffee.hjad()
coffee.head()

Use <code>coffee = pd.read_csv("Intro_to_Python/coffee_data.csv")</code> to re-read in the data if you get an error message

In [None]:
# 1. How many columns and rows does the coffee data frame have?


In [None]:
# 2. What are the data types of the different columns in the coffee dataframe?



In [None]:
# 3. Is there any data obviously missing from the dataframe?


In [None]:
# 4. Using .describe() summary statistics for which variables are shown? 


Why only these variables: (write markdown (text) answer here)

In [None]:
# 5. What is the average (mean) price of coffee?


In [None]:
# 6. What is the standard deviation in coffee price?


In [None]:
# 7. What is the median coffee rating?


In [None]:
# 8. Using .nlargest() how many coffees scored 5/5?


Number of coffees scoring 5/5: (type answer here- markdown)

Write markdown (text) answer here:




### Extra task answers <a class="tocSkip">   
    
#### How many columns and rows does the coffee data frame have? <a class="tocSkip">   
<img src="screenshots/coffee_shape.png" width=1000 >
24 rows, 6 columns
    

#### What are the data types of the different columns in the coffee dataframe?<a class="tocSkip"> 
<img src="screenshots/coffee_dtypes.png" width=1000 > 
Object (string): coffee_name, coffee_type, coffee_shop
    
Float: price
    
Integer: rating
    
Boolean: cake_deal
    
#### Is there any data obviously missing from the dataframe?<a class="tocSkip">
<img src="screenshots/coffee_info.png" width=1000 >     
    
![coffee_info.png](attachment:coffee_info.png)
No: There are 24 rows and no null enteries (all entering are non-null). Indicating no missing data (Nan).
    
    
*Answers are screenshots of code so cannot be copied and pasted. Type out the code yourself if you get stuck. 

<img src="screenshots/answers_coffee_2.png" width=1000 >     


# FAQs 

## Why do some functions have parentheses and others do not? <a class="tocSkip">

Parentheses indicate the difference between methods versus attributes

### Attributes: <a class="tocSkip">
- do not have parentheses () 
- are values associated with an object e.g. <code> <font color='blue'>dataframe</font>.shape, <font color='blue'>dataframe</font>.column, <font color='blue'>dataframe</font>.dtypes </code> these are all attributes of the dataframe (object) they are applied to. 

### Methods: <a class="tocSkip">
- do have parentheses () 
- are functions associated with particular objects. E.g. <code> <font color='blue'>dataframe</font>.head(), <font color='blue'>dataframe</font>["<font color='blue'>column</font>"].mean() </code> 

<a id='file_paths'></a>
# File Paths

A file path identifies the exact unique location of a file or folder in a file system.
<br>E.g. This notebook was created in the following folders.<br>
* F: (USB)
    * 2021_22 (Folder)
        * Data_Science (Folder)
            * Intro_to_Python (Folder)
                * Intro_to_Python.ipyn (the Jupyter Notebook file)<br>


Resulting in the file path:"F:\2021_22\Data_Science\Intro_to_Python\Intro to Python.ipynb"

