![IE](img/ie.png)

# Statistical Programming with Python

##  Master in Business Analytics and Big Data, April 2020-2021

### Professor: Juan Luis Cano Rodríguez <jcano@faculty.ie.edu>

# Outline

- Who am I?
- What is Python? Why Python?
- PyData Ecosystem
- About this course
    - Calendar, sessions
    - Material
    - Software
    - Learning objectives
    - Evaluation method

# Who am I?

![Me](img/juanlu_esa.jpg)

* **Aerospace Engineer** from TU Madrid + 1 year at Politecnico di Milano
* **Planning & Execution Engineer** at **Satellogic**, a satellite imagery company
* **Contributor** to the PyData ecosystem: NumPy, SciPy, conda, Dask, ...
* **Instructor** of Python courses for Data Scientists at Airbus, Boeing, Telefónica and others
* Experience as **Data Scientist** with Python for telco and aerospace industries
* Former chair of the **Python España** non-profit and former co-organizer of **PyCon Spain**
* Free Software and Open Culture advocate and Python enthusiast

# What is Python?

![Major languages](img/growth_new.png)

* **Python is** a dynamic, interpreted programming language
* It is easy to learn, but powerful
* Features a huge ecosystem of contributed packages
* Most rapidly growing language in Stack Overflow

<small>Source: https://stackoverflow.blog/2017/09/06/incredible-growth-python/ (data updated as of 2020)</small>

## Why is Python growing so quickly?

![Related tags](img/related_tags_over_time.png)

* Quick answer: **Data Science**
* Boundless growth since the creation of pandas
* Already popular for scripting and web development

<small>Source: https://stackoverflow.blog/2017/09/14/python-growing-quickly/</small>

More interesting surveys:

* Kaggle, 2020: 83 % of data scientists use Jupyter, 80 % use scikit-learn https://www.kaggle.com/kaggle-survey-2019
  - Comparatively, 38 % use RStudio
* KDnuggets, 2019: Most used programming language in Data Science for third year in a row https://www.kdnuggets.com/2019/05/poll-top-data-science-machine-learning-platforms.html
  - 66 % of data scientists use Python
  - Comparatively, 47 % use R and 33 % use SQL or Excel
  - 34 % use specifically Anaconda

## Disadvantages

* Python itself is _slow_ - the trick is that it wraps faster languages (C, C++, FORTRAN)
* Weaker time series and statistical analysis (use R for that!)
* Packaging, installation and distributing can be tricky (if interested, choose my elective "Advanced Python")
* ~~Python 2 vs 3 split~~ Python 2 is officially dead since January 2020 🎉 focus on Python 3.7 onwards

# PyData ecosystem

![PyData ecosystem](img/ecosystem/1.png)

![PyData ecosystem](img/ecosystem/2.png)

![PyData ecosystem](img/ecosystem/3.png)

![PyData ecosystem](img/ecosystem/4.png)

![PyData ecosystem](img/ecosystem/5.png)

# About this course

## Learning objectives

1. Understand the basics of the Python programming language syntax
2. Learn how to use Python to solve algorithmic problems
3. Understand the different pieces of the PyData ecosystem and the relationship between them
4. Learn how to use the Jupyter environment to conduct exploratory analysis and create reports
5. Learn how to use pandas to manipulate data
6. Learn how to use scikit-learn to solve classical machine learning problems
7. Learn how to use seaborn and Plotly Express to perform static and interactive visualization

## Calendar

* **From September 24th to October 21st**
  - August: 2 F2F sessions, 1 Forum - Python basics
  - September: 1 F2F session, 3 Videoconferences, 5 Forums - pandas, scikit-learn
  - October: 3 F2F sessions, 2 Videoconferences, 3 Forums - visualization, final presentations
* **All F2F and Videoconference sessions will be hands on**, have your Jupyter notebooks ready!

## Final project

* Analyze a moderately big dataset, write a report, and present the results
* You will have to apply all the things we see in the subject
* Details coming soon

## Evaluation method

> The evaluation be based on the daily progress of the student, as well as a big group project that will be presented during the last session.

| Criteria                   | Score % |
|----------------------------|---------|
| Class Participation        |  10 %   |
| Individual Assignments     |  45 %   |
| Group Project Report       |  35 %   |
| Group Project Presentation |  10 %   |

## Material

* Python basics: "A Whirlwind Tour of Python" by Jake Vanderplas https://github.com/jakevdp/WhirlwindTourOfPython
* Main content: "Python Data Science Handbook" by Jake Vanderplas https://github.com/jakevdp/PythonDataScienceHandbook
* Plotly Express: official tutorials https://plotly.com/python/plotly-express/

...yes, everything is freely available on the Internet :)

## Software

> We will use the **Anaconda distribution 2020.02** or newest with **Python 3.7**

* Compatible with Windows, Mac and Linux
* Contains all the packages that we will need
* Easy to install additional packages
* Backed up by a for-profit company, Anaconda http://anaconda.com/ (formerly Continuum Analytics)

Cloud environments:

- Binder https://mybinder.org/
- Google Colab https://colab.research.google.com/

# Shall we begin?

![Talk is cheap](img/quote-talk-is-cheap-show-me-the-code-linus-torvalds-273528.jpg)