# Python Foundations for Data Science 🤓

# Agenda for our workshop ⏰

### Basics first 🐣 Python Data Types (40 min)
*  Strings
*  Practice Round ✍️
*  Integers and Floats
*  Practice Round ✍️

### Data Analytics - Reading Data in Python 🧮(30 min)
*  Tools we will use
*  Importing data from CSV
*  Previewing the DataFrame
*  Practice Round ✍️

### Data Analytics - Our first analysis 🔍(50 min)
*  Boolean analysis
*  Our first analysis - let's check out the largest countries
*  Combining data points for more insight
*  Practice Round ✍️
*  Data visualization

### Welcome to the Jupyter Notebook! How to use this file? 🤔

* Type inside the empty cells to write code. These empty cells will have a `In [ ]:` prefix before
* Press the `return/enter ⏎` key to add a new line inside the cell
* To display your results use the Python built in `print(STUFF_YOU_WANT_TO_PRINT)` method or simply put the stuff you want to print as the last line inside the cell. The result of the last line will appear as the `Out[]:` or the output of the cell :)
* Press `shift` + `return/enter ⏎` to run your code 🤓 this will run the code inside your currently selected cell and print anything inside `print()` method and the last line of your cell
* To add a new cell, select any cell and press the `b` key (make sure you are not just typing the letter `b` in the cell). This will add a new cell below
* To delete a cell, double press the `d` key (make sure you are not just typing the letter `d` in the cell)

#### Try to run the code below! 😃

In [None]:
print("Welcome to the Python workshop of Le Wagon!")
print("Are you ready to do your first data analysis?")
"Yes we are! 🚀"

## Python Basics ⚙️

    Everything in Python is an object. Objects have attributes and methods. 
Real life example: a car as an object, has attributes such as color, make, year of manufacture, transmission mode etc., and methods such as accelerating, reverse, braking.

### Python Data Types and Variables 🧱
- Strings
- Integers
- Floats
- Booleans & Logicals
- Variables
- Lists & Dictionaries*

*Only FYI

### STRINGS   str() 📝
To represent a text and defined with "double" or 'single' quotes

    "Python for Beginners"
    'Python for Beginners'
    '2019'

#### Built-in Methods for Strings
    type("python")
    int('2019')
    len("python")
    'python'.capitalize()
    "python is fun".count('n')
    "PYTHON".lower()
    'python'.index('o')

### Your turn! 🚀Try out `strings` below!

How? Type in a line of code on the empty line below and press `Shift` + `enter/return ⏎` keys to run it

### INTEGERS   int() 🔢
To represent integers and can do standard arithmetics.
    
    10
    -24
    2019

#### Built-in Methods for Integers
    type(10)
    abs(-24)
    min(10, -24, 2019)
    max(10, -24, 2019)
    float(10)

#### Arithmetic Operators
    1 + 2    => 3  (Addition)
    1 - 2    => -1 (Subtraction)
    2 * 4    => 8  (Multiplication)
    5 / 2    => 2.5(Division)
    5 // 2   => 2  (Floor division)
    5 % 2    => 1  (Modulus)
    2 ** 3   => 8  (Exponent)

### FLOATS   float()
To represent decimal numbers and can do standard arithmetics.

    10.23132
    -24.1
    2019.0

#### Built-in Methods for Floats
    all built-functions for integers
    round(10.23132, 2)
    import math # first import math package - a collection of advanced mathematical functions
    math.floor(10.23132)   
    math.ceil(-24.1)
    math.ceil(24.1)
    math.factorial(5)

###  Your turn! 🚀Try out `integers` and `floats` below!

How? Type in a line of code on the empty line below and press `Shift` + `enter/return ⏎` keys to run it

### BOOLEANS & LOGICALS   bool() 🔀
Booleans represent something that is True or False

    True
    False
    
Comparison Operators always yield Boolean values

     ==       # different from assignment operator '='
     !=
     > or >=
     < or <=
     
Logicals include:

    or
    and
    not

### VARIABLES 📦
- Allows you to store values to reuse later
- You assign a value to a variable
- Variables can be overwritten and incremented
- By convention, variable names should be in snake_case

Variable assignment statement - *putting things in a box*

    putting the `string` "John" in the box `first_name`
        
        first_name = "John"
        
    putting the `integer` 29 in the box `age`
    
        age = 29
       
    
Variable reading - *opening the box to see what is inside*

    print(age)
    print(first_name)

###  Your turn! 🚀Try out `booleans` and variables

How? Type in a line of code on the empty line below and press `Shift` + `enter/return ⏎` keys to run it.

Some ideas:
* make two variables and check if they are equal (*pssst*, ==)
* make two variables - your first name and last name - and try to add them together
* create a variable with your age in years - now calculate your age in minutes (approximately)

# Data Science - Analysis 🧮

### First of all - what are we using? 🔨

[Jupyter Notebook](https://jupyter.org) is an open-source web application which allows you to create and share documents with code, visualizations and narrative. Which is what we are doing right now! 

[Numpy](https://www.numpy.org) fundamental package for scientific computing in Python.

[Pandas](https://pandas.pydata.org) is an open-source library providing easy to use data structuring and analytics for Python. 

### So this is how every Jupyter notebook starts...

In [None]:
import numpy as np
import pandas as pd
print("You are good to go!")

### Python and Pandas is great for reading data files, like CSV

In [None]:
file = "countries of the world.csv"
countries_data = pd.read_csv(file, decimal=",")
# no need to use print() when we want to see DataFrames, simply put what you want to see on the last line
countries_data

### We can then have a quick look at our `DataFrame` 🗺

We can run `countries_data.shape` to check how many rows and columns we have. The result will be printed as (rows, columns)


In [None]:
print("rows and columns -->", countries_data.shape)

We can run `countries_data.columns` to check all the columns that our DataFrame has


In [None]:
print("all column names -->", countries_data.columns)

We can run `countries_data.head(YOUR_NUMBER)` to check the first `YOUR_NUMBER` of records of the DataFrame


In [None]:
countries_data.head(5) # will show the first 5 records

We can run `countries_data.tail(YOUR_NUMBER)` to check the last `YOUR_NUMBER` records of the DataFrame


In [None]:
countries_data.tail(5) # will show the last 5 records

We can run `countries_data[COLUMN_NAME]` to see only the values of `COLUMN_NAME`


In [None]:
countries_data["Population"]

We can run `countries_data[[COLUMN_NAME_1, COLUMN_NAME_2,...etc]]` to see a few values together using the double square bracket notation


In [None]:
countries_data[["Country", "Population"]]

#### Bonus: What if you want to show data for just one country?

In [None]:
countries_data['Country'] = countries_data['Country'].map(str.strip) # Clean the text in Country column
countries_data.set_index('Country', inplace=True) # Set Country names as the unique index
countries_data

Now we can see instead of numbers, we have country names as our index

⚠️**Note** that when you do this you remove the `Country` column, and make it the index of the whole DataFrame instead (for example, `country_data['Country']` will not work any more)

We can simply use `countries_data.loc['COUTRY_NAME']` syntax to get data for one country

In [None]:
countries_data.loc['Algeria']

###  Your turn! 🚀Run the code above to import `numpy` and `pandas` and read the CSV to get `countrie_data`. No need to change anything there ;)

#### Now play around with your `DataFrame` 🤓

Some ideas:
* check the amount of rows and columns in your DataFrame
* check what are all the columns that you have
* view only the "Climate" column of your DataFrame
* view the first 10 rows of your DataFrame
* view the last 10 rows of your DataFrame, but only viewing the "Country", "Population" and "Climate" 🤔

How? Type in your code on the empty line below and press `Shift` + `enter/return ⏎` keys to run it

## Now let's do some data analytics! 🔍

#### Let's say we want to see countries with over 50,000,000 people. We want to create a `boolean` - `True` or `False` - with this condition

In [None]:
large_population = countries_data["Population"] > 50_000_000
large_population

#### We can see that `large_population` is now a list of `True` and `False` values, matching our countries DataFrame 🧐

#### Now we can apply this condition to our `countries_data` using the square brackets `[condition]`, same syntax as we used for checking columns 

In [None]:
large_countries = countries_data[large_population]
large_countries

#### Boom 💥we have our first data analytics solution!

We can also sort our large_countries by Population, having largest first, using the syntax of `large_countries.sort_values("COLUMN_TO_SORT_BY", ascending=False)`, like below


In [None]:
large_countries.sort_values("Population", ascending=False) # change False to True if you wan't smallest countries first :)

#### We can also combine data points! Let's see which of our large countries have the biggest Service industry 🤔


In [None]:
large_service_countries = large_countries.sort_values("Service", ascending=False)
large_service_countries

Let's make our view a little cleaner - I only want to see the Country, Population and the ratio of Service industry. We can use the `large_service_countries[[COLUMN_NAME_1, COLUMN_NAME_2,...etc]]` syntax for that 🔍


In [None]:
large_service_countries[["Country", "Population", "Service"]].head()

#### Bonus: What if you want to show data for just one country?

In [None]:
countries_data['Country'] = countries_data['Country'].map(str.strip) # Clean the text in Country column
countries_data.set_index('Country', inplace=True) # Set Country names as the unique index
countries_data

Now we can see instead of numbers, we have country names as our index

We can simply use `countries_data.loc['COUTRY_NAME']` syntax to get data for one country

In [None]:
countries_data.loc['Algeria']

###  Your turn! 🚀 Explore your `DataFrame` and run some analysis 🤓

Some ideas:
* check which countries have the warmest climate ☀️
* let's check the tiny countries! make a variable to store countries with Population < 3_000_000
* Beach life! 😎check the countries with longest Coastline. Remember to type the full column name!
* Are these coastal countries always online? 📱Let's sort them by amount of Phone usage. Again, check the full column name!
#### ...and anything else you want to analyze! 🚀

How? Type in your code on the empty line below and press `Shift` + `enter/return ⏎` keys to run it

## Finally, let's make our findings visual 🎨

Once again, we already have Python libraries available for us with all we need.

We will be using [matplotlib](https://matplotlib.org/) - a very popular data visualization library for Python.

In [None]:
%matplotlib inline
import matplotlib # same as with pandas and numpy, we need to first import the library

Let's start by creating our first [plot](https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.DataFrame.plot.html) out of our `large_service_countries` data

In [None]:
large_service_countries.plot()

😰 Seems quite messy... 

Let's use the documentation of [DataFrame.plot](https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.DataFrame.plot.html) and [matplotlib](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.plot.html?highlight=plot#matplotlib.pyplot.plot) to see how we can clean up our plot.

#### First of all, let's reduce the amount of data points we are trying to display

In [None]:
large_service_countries.head().plot(y='Service')

#### What is this strange zig-zag? Our horizontal axis (or the x-axis)! 💡

In [None]:
large_service_countries.head().plot(y='Service', x='Country')

#### This line chart is not very fitting... let's see that other [kinds](https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.DataFrame.plot.html) are available to us

In [None]:
large_service_countries.head().plot(kind='bar', y='Service', x='Country')

#### Already much better! Now let's make it more interesting 🤓

In [None]:
large_service_countries.plot(
kind='bar',
y=['Service', 'Industry', 'Agriculture'],
x='Country',
stacked=True,
color=['#809BCE', '#B8E0D2', '#EAC4D5'],
figsize=(8, 6)).legend(loc='lower left')

###  Your turn! 🚀 Visualize the `DataFrame` you've made in the previous challenges or make new ones 📊

Remember to use online resources to see what are the different attributes you can play with:

* Matplotlib documentation on [plot attributes](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.plot.html?highlight=plot#matplotlib.pyplot.plot)
* Pandas documentation on the [DataFrame.plot() function we are using](https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.DataFrame.plot.html)
* 👉 [This neat article](http://queirozf.com/entries/pandas-dataframe-plot-examples-with-matplotlib-pyplot) with a bit more visual support