# Introduction to Python

Before we get into statistics and Machine Learning, it is good to dwell a little on pure Python programming. In this course, we will cover the main concepts of:

* Our book: [Introduction Pratique à Python](https://julie-2-next-resources.s3.eu-west-3.amazonaws.com/Introduction_Pratique_%C3%A0_Python-Jedha.pdf) 📗

* Our online course: [Introduction to Python for Data Science](https://app.jedha.co/track/introduction-to-python-for-data-science) 📹

Both of which are part of your student's perks ! 🤗

## What you'll learn in this course 🧐🧐

- Create variables
- Write conditions & loops
- Importing databases using Pandas library
- Manipulating databases using Pandas library

## Python programming 🐍

Python is a programming language created in 1989. It was designed to be easily understandable and in fact has multiple applications. In addition to Data Science applications, you can use this language for programming video games, web applications and much more. 

### What will we need to code? 💻

There are many programming environments you can use to write your Python code. For this course, you can use whatever suits you best : Jupyter Lab locally, [Google Colab](https://colab.research.google.com/?hl=fr) or you can also subscribe to [JULIE's workstations](https://app.jedha.co/workspace) in which Jupyter and all the tools needed to do Data Science are installed.


### Hello World 👋

Let's try a first program in our text editor.

In [3]:
first_name = input("What's your name?")
print("Hello {}, how are you today?".format(first_name))

Hello Obi-Wan Kenobi, how are you today?


This program asks the user for your name and stores it within what is called a *variable*. In this case, the variable is called `first_name`.

Next, we ask the program to say hello to the first name that has been entered into the `first_name` variable using the `print()` function.

This is the simplest program in the world but it allows us to see some useful concepts. Let's now create a complete application that will, by the way, teach you the fundamental concepts of the Python language.

### Write a condition 🌴

There's nothing like building a quizz to understand programming principles. The idea of this application is to answer three questions. The user will have 3 "chances" to answer all the questions.

This is not easy for a first application, so let's do it step by step.

A first concept to master is the way you build a condition. Indeed, we will need to check if the user has given, or not, the right answer to the question.

Here is how it is structured:

In [None]:
question = input("What is the color of Henry IV's white horse?")

if question == "white":
  print("bravo ! that's the right answer")
else:
  print("Too bad... that was the wrong answer")


Here is the principle for formulating a condition. It starts with an `if` followed by the condition you want to check. If the condition is not verified, your program needs to know what to do. This is where the `else` comes in. It will allow you to say what the program should do if the condition is not met.

🚨 **WARNING**: indentation is very important in Python. If you don't respect it, your script won't work.


### Create a loop 🔁

Now that we know how to verify the answer to a question, we should not move on to the question based on whether the answer given is wrong. This can be done through a loop.

There are two types of loops:

#### For

The _for_ loop allows to iterate over a **finite number** of elements. For our case, we could say for example that the person has 3 chances before losing the quiz and thus getting out of the loop

In [None]:
for i in range(0, 3):
  print("you have {} chances".format(3 - i))
  question = input("What is the color of Henry IV's white horse?")
  if question == "white":
    print("Bravo, you got it")
    break
  else:
    print ("Too bad, you didn't get the right answer")
    if i == 2:
      print("Ah, you lost the game...")


Here, we count the number of iterations thanks to the `range()` function which allows us to give a value to `i` at each iteration.

_NB:_ We used the `break` notation which allows to exit a loop even if the iterations are not finished. This is quite useful although not necessarily a best practice to use.

This is not the most elegant way of doing it, however. There is another type of loop that should be able to help us

#### While

The _while_ loop allows iterating as long as a condition is true. This has the advantage of not having to specify the number of iterations needed in the loop. So let's rewrite the same code with a _while_ loop.

In [None]:
question = input("What is the color of Henry IV's white horse?")

#### Beginning of While loop
while question != "white":
  print("Too bad, that's not the right answer")
  question = input("What is the color of Henry IV's white horse?")
  #### End of While loop

print("Bravo! You've found the answer")

### Data types in Python 🦧

As in all programming languages, Python has several "types" of data. We have begun to see some of these in the various examples above. But there are many more.

Although you don't need to know each type of data by heart, it's good to understand what they are used for, because then you can understand the different logical operations you can perform with them.



#### Tuples

N-tuple is a heterogeneous and unchanging collection of data. Here is an example:

In [5]:
a_tuple = (10, 20, "this is a N-tuple'", 3.14)

You can access each item in your tuple in the following way:

In [6]:
print(a_tuple[0])

10



This will give us the result 10

NB: Be careful, in computing everything starts at 0 and not at 1, that's why the first item of our tuple is at index 0.


#### Lists

A list looks very much like a tuple with the difference that a list is mutable (therefore changeable). Here is an example:

In [8]:
a_list = [1, 3, 10, "This is a list", 2.1095]


In the same way as a tuple, you can access each item in your list by indicating the index of the item you want to select in square brackets :

In [9]:
print(a_list)

[1, 3, 10, 'This is a list', 2.1095]



🚨 **WARNING**: As written above, the difference between a tuple and a list is that the first is not editable while the second is. Here is an example:

In [10]:
a_tuple[0]=230

TypeError: 'tuple' object does not support item assignment

Whereas for a list:

In [11]:
a_list[0] = 230
print(a_list)

[230, 3, 10, 'This is a list', 2.1095]


While the first item in our variable initially had the value 1, we were able to change it to the value 230, which we cannot do a tuple.

#### Dictionaries 📖

Finally, a dictionary allows to associate a value to a defined key. Here is an example:

In [20]:
a_dict = {"first_name": "Michel", "last_name": "Delpeche" }

In [21]:
#### print keys
print("#### Print keys")
print(a_dict.keys())

#### print values
print("#### Print values")
print(a_dict.values())

#### Print keys
dict_keys(['first_name', 'last_name'])
#### Print values
dict_values(['Michel', 'Delpeche'])


You can also access values within a dictionnary by specifying its key name. 

In [22]:
#### Access specific value 
print(a_dict["first_name"])

Michel


The advantage is that you can access the key or the values corresponding to this key.

## Manipulating data with Pandas 🐼

In Data Science, one of the must-know libraries is **Pandas**. This library allow you to manipulate databases very easily.

Before we start, let's not forget that for all the operations we show in this course, we have imported the Pandas library as follows:

In [23]:
#### You need to write the below code 
#### to import pandas 
import pandas as pd

### A new data type : The DataFrame

A DataFrame is a typical Pandas object that you will manipulate. This object has two dimensions with rows and columns. You can also think of a DataFrame as an excel sheet.

For example:

In [24]:
# Create a Dataframe from scratch
### keys of the dictionnary corresponds to column name 
### Values of the dictionnary corresponds to rows
pd.DataFrame({"A":[1,2,3,5], "B":[1,4,2,3]})

Unnamed: 0,A,B
0,1,1
1,2,4
2,3,2
3,5,3



### Import data

Most of the time as a Data Scientist, you're not going to create databases out of thin air 🌬️. You're already going to have the data in files. In this section, we're going to explain how to manage those files.

Let's say we have a file named : `data.csv`

We can then import data the following way:

In [None]:
### Import data from a csv file
df = pd.read_csv("data.csv")

### Preview a database 💐

For the rest of this part, we will have a dataset stored in a variable _iris_, shown in the following way:

In [4]:
#### Importing data from Iris Data
iris = pd.read_csv("iris.csv")

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
...,...,...,...,...
145,6.7,3.0,5.2,2.3
146,6.3,2.5,5.0,1.9
147,6.5,3.0,5.2,2.0
148,6.2,3.4,5.4,2.3


If you want a description of the dataset, feel free to read below: 

In [20]:
### Print description of iris dataset
from sklearn.datasets import load_iris ### Don't pay attention to this code for the moment
print(load_iris()["DESCR"])

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
                
    :Summary Statistics:

                    Min  Max   Mean    SD   Class Correlation
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :


#### `.head()`

To get a quick overview of the "top" of our dataset, we can use `iris.head()` in the following way :

In [5]:
### Preview first 5 rows of your dataset
iris.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


#### `.columns`

You can also see the names of your columns via `iris.columns`.

In [6]:
iris.columns

Index(['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)',
       'petal width (cm)'],
      dtype='object')


### Selecting values from a database

Now that we have a better idea of what our dataset is. Let's see how we can select only a portion of our base values.

First, a very simple way to select a column is to do: `name_of_dataframe["column_name"]`. For example:

In [7]:
iris["sepal length (cm)"]

0      5.1
1      4.9
2      4.7
3      4.6
4      5.0
      ... 
145    6.7
146    6.3
147    6.5
148    6.2
149    5.9
Name: sepal length (cm), Length: 150, dtype: float64

#### `.loc` VS `.iloc`

Here are the two methods you will use most often because they are the most flexible and precise: `.loc` & `.iloc`

`.loc` will allow you to select the rows and columns you want to have.

In [9]:
iris.loc[1:10, ["sepal length (cm)", "petal width (cm)"]]

Unnamed: 0,sepal length (cm),petal width (cm)
1,4.9,0.2
2,4.7,0.2
3,4.6,0.2
4,5.0,0.2
5,5.4,0.4
6,4.6,0.3
7,5.0,0.2
8,4.4,0.2
9,4.9,0.1
10,5.4,0.2


In this example, we have selected the first 10 rows and columns `sepal length (cm)` and `petal width (cm)`.

As you can see, the general structure of `.loc` will be

```python
dataset.loc[start_line: end_line, [name_of_column_1, name_of_column_2, ..., name_of_column_n]]
```

However, sometimes it is much simpler to choose the index number in the columns rather than the name. So we'll use `.iloc`

In [12]:
iris.iloc[1:10, [0, 3]]

Unnamed: 0,sepal length (cm),petal width (cm)
1,4.9,0.2
2,4.7,0.2
3,4.6,0.2
4,5.0,0.2
5,5.4,0.4
6,4.6,0.3
7,5.0,0.2
8,4.4,0.2
9,4.9,0.1


Here, for example, we have selected the columns `sepal length (cm)` and `petal width (cm)`  by their **index** i.e `0` and `1` . 

This writing is therefore much more elegant and efficient than the one above when column names are hard to write. 

### Data Manipulation with Pandas

To complete this course, it is useful to know some Pandas functions to allow you to do some calculations.

#### Mean, standard deviation

To calculate a mean or standard deviation, you can use the methods `.mean()` and `.std()` :

In [13]:
#### Calculate mean of each column of a dataset
iris.mean()

sepal length (cm)    5.843333
sepal width (cm)     3.057333
petal length (cm)    3.758000
petal width (cm)     1.199333
dtype: float64

In [14]:
#### Calculate standard deviation of each column of a dataset
iris.std()

sepal length (cm)    0.828066
sepal width (cm)     0.435866
petal length (cm)    1.765298
petal width (cm)     0.762238
dtype: float64

#### Describe

If you want to know the main statistics in one line of code, you can use `.describe()`.

In [15]:
#### Get main statistics of a given dataset
iris.describe()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
count,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333
std,0.828066,0.435866,1.765298,0.762238
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


## Resources 📚📚

- Introduction pratique à Python - [https://bit.ly/20DK](https://docs.google.com/document/d/1YyVDVGR_k89m7iLHKFO79xvsDAPswfuFHdh5oY8XBCU/edit#heading=h.vxfub9k5hsn7)
- Introduction to Python Workshop - [https://bit.ly/E3Dcge](https://github.com/JedhaBootcamp/introduction-pratique-a-python)
- How to learn Pandas - [https://bit.ly/2CDDc4Z](https://bit.ly/2CDDc4Z)