![](https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcRNNMSrl6xvbfPjLuk9k3dl7VyNi0ky19x11A&usqp=CAU)

# **Introduction to Python for Data Science - Part 1 : First Steps with Python**
Ericsson - August 4 to August 13, 2020

by Sarah Legendre Bilodeau, M Sc., HEC Montréal

Autors :
Thomas Vaudescal,
Sarah Legendre Bilodeau,
Laurent Barcelo


# In This Training

In this training, you will learn the **basics** in programming with the **Python** language and learn how to use tools to **prepare**, **manipulate** and **transform** your data for **visualization** and **analysis**. 

This Python training will cover several important aspects such as data manipulation, data cleansing, visualization and the use of some packages that are frequently used in data science today. 



# What is the Python Language?

The first appearance of the Python language was before the 2000s. This language was designed to be a **readable** and **simple** language. Python is **easy to learn** and **easy to use**. This language offers a wide range of possibilities and makes it possible to create programs quickly and with little effort. Being cross-platform, open source and versatile, Python is a great language for data science and is widely recognized in the development community. 


Today, emerging fields such as **Data Science**, **Artificial Intelligence** and also **Machine Learning** play an important role in problem solving. Python has become a reference language for implementing solutions in these fields. 



# The Choice of the IDE (Integrated Development Environment)
  
Even if the Python language can be used with the Python Shell for simple needs, the use of an appropriate IDE is necessary in Data Science. Indeed, the use of IDE allows more user-friendly interface and several tools that make programming fun.

In this training, we will use Jupyter Notebook on Google Colab. 

It's very simple : all you need is a Google account!

## What is Jupyter Notebook?

**Jupyter Notebook** is a client-server application created by the non-profit organization **Project Jupyter**. It was published in 2015. It allows the creation and sharing of JSON Webformat documents consisting of an ordered list of **input cells** and **output cells** organized according to successive versions of the document. The cells can contain, among others, **code**, text to **Markdown format**, **mathematical formulas**. Processing is done with a client application running on the **Web**, which is accessed by the **usual browsers**. The Jupyter Notebook is easy to use. Tutorials are available on the site. 

The advantages of using a **Jupyter Notebook** are as follows: 
- Mature and free solution
- Visualization of Pandas Dataframes
- Interactive and executable code sharing

There are several ways to access a **Jupyter Notebook** : 
- Directly on the website : [Jupyter](https://jupyter.org/)
- By [Anaconda](https://www.anaconda.com/)
- On Google Colab


# Base Python Programming and Packages

The python language offers many basic capabilities for programming. However, it is the ability to install specialized function packages for data manipulation that is the main advantage of using Python in Data Science.

Thus, during this training, we will start with some basic content about the base language of Python and learn how to use some powerfull packages that are specialized in Data Science. For exemple :

- Pandas
- Numpy
- Matplotlib
- Seaborn
- SciPy
- Scikit-learn
- statsmodels
- ...

# Python Language Basics

Indentation : Python uses whitespaces (tabs or spaces) to structure the code

Object : Everything is an object (number, string, data structure, function, class, module, ...)

Comments : Any text preceded by the hash mark (#) is ignored by the Python interpreter. That symbol is often used to add comments in the code (or to exlude some parts of the code)





# Functions

A function is a block of code which only runs when it is called. It allows you to perform tasks on the data.

You can pass data, known as parameters, into a function. Parameters must be specified within parentheses :

`my_function(param1, param2, ...)`

A function can return data as a result.




# Methods

Almost every objects in Python has attached functions, known as methods, that have access to the object's internal contents. 

`obj.some_method(param1, param2, param3)`

# Variables

**Without variables, it is impossible to write a program**. This term refers to the fact of assigning a name or **identifier** to **information**: by naming them, we can **manipulate** this information much more easily. 

The other advantage is to be able to write **valid programs** for **varying values**: we can change the value of the variables and the program will always run in the same way and will make the same types of calculations whatever the values manipulated. The variables play a **similar role** to the **unknown in a mathematical equation**.
   
To summarize, a variable is characterized by :

 - An **identifier** (a name): it can contain letters, numbers, underlined blanks, but it cannot begin with a number. **Upper and lower case letters are differentiated**. 

 - A **type** : it is an information about the content of the variable which tells the python interpreter how to manipulate this information.

## Assigning Values to Variables

Creating a variable is reserving memory space. Unlike some other languages, Python variables do not need explicit declaration to reserve memory space. The equal sign (=) is used to create and assign values to variables. 

In [1]:
# Here we create a "Test" variable that contain the string "Hello".
# Using the type() function, we can find out the type corresponding to the variable. 
test = "Hello" 
type(test) # type function

str

In [2]:
# Calculation example using variables 
n = 5
k = 2

Product = k*n

print(type(Product)) #print and type functions
print(Product)

<class 'int'>
10


## Standard Data Types

Python has various standard data types that are used to define the possible operations with them and the storage method for each of them.

Python has five standard data types :

- Numbers
- String
- List
- Tuple
- Dictionary



### Numbers

Number objects are created when you assign a **numerical value** to them. 

Python supports four different numerical types :

- *int* (signed integers)
- *long* (long integers, they can also be represented in octal and hexadecimal)
- *float* (floating point real values)
- *complex* (complex numbers)

In [3]:
# Numbers
num1 = 23
num2 = 2.3

print(type(num1))
print(type(num2))

<class 'int'>
<class 'float'>


### Strings

String is a finite sequence of characters. In other words: **text**.

Strings are identified as a contiguous set of characters represented between quotation marks. Pairs of single or double quotes are allows. Subsets of strings can be taken using the slice operator ([ ] and [:]) **with indexes starting at 0** in the beginning of the string. Do not confuse the part between quotes or apostrophes, which is a constant, with the variable that contains it.

The plus (+) sign is the string concatenation operator and the asterisk (*) is the repetition operator.

In [6]:
# Strings

str1 = "My first string! 1234"

print(str1)         # Complete string
print(str1[0])       # First character of the string
print(str1[3:8])     # Characters starting from 4th to 9th
print(str1[3:])      # String starting from 4th character
print(str1 * 2)      # String two times
print(str1 + "TEST") # Concatenated string
print(type(str1))

str2 = "It's a good test for you"
print(str2)

str3 = 'It s a good test for you'

My first string! 1234
M
first
first string! 1234
My first string! 1234My first string! 1234
My first string! 1234TEST
<class 'str'>
It's a good test for you


Use the ``format`` method or the ``%`` operator to format strings, even when the parameters are all strings. Use your judgment to decide between + and % (or format). A few examples:

In [7]:
a = 3
b = 4

x = a + b
print(x)

x1 = '%s, %s!' % (a, b)
print(x1)

x2 = '{}, {}'.format(a, b)
print(x2)

x3 = 'a: %s; b: %s' % (a, b)
print(x3)

x4 = 'a: {}; b: {}'.format(a, b)
print(x4)

x5 = f'a: {a}; b: {b}'
print(x5)

#print("a:" + a + " b:" + b) # can only concatenate str (not "int") to str
print("a:" + str(a) + " b:" + str(b))

7
3, 4!
3, 4
a: 3; b: 4
a: 3; b: 4
a: 3; b: 4
a:3 b:4


### Lists

Lists are the most versatile of Python's compound data types. A list contains items separated by commas and enclosed within **square brackets** ([]).
All the items belonging to a list can be of different data type.

The values stored in a list can be accessed using the slice operator ([ ] and [:]) **with indexes starting at 0** in the beginning of the list.



In [8]:
# Lists

list1 = [ 'abcd', 123 , 10.2, 'EXAMPLE', 70.2 ]
list2 = [456, 'OTHER']

print(list1)          # Prints complete list
print(list2)

print(list1[0])       # First element of the list
print(list1[1:3])     # Elements starting from 2nd till 3rd 
print(list1[2:])      # Elements starting from 3rd element
print(list2 * 2)  # List2 two times
print(list1 + list2) # Concatenated lists

print(type(list1))
print(type(list1[1]))

print(list1[0][1])

['abcd', 123, 10.2, 'EXAMPLE', 70.2]
[456, 'OTHER']
abcd
[123, 10.2]
[10.2, 'EXAMPLE', 70.2]
[456, 'OTHER', 456, 'OTHER']
['abcd', 123, 10.2, 'EXAMPLE', 70.2, 456, 'OTHER']
<class 'list'>
<class 'int'>
b


### Tuples

The tuple is similar to the list, because it consists of a number of values separated by commas.

The main differences between lists and tuples are that the lists are enclosed in brackets ( [ ] ) and their elements and size can be changed, while tuples are enclosed in parentheses ( ( ) ) and cannot be updated. 

In [9]:
tuple1 = ( 'abcd', 123 , 10.2, 'EXAMPLE', 70.2 )
tuple2 = (456, 'OTHER')

print(tuple1)          # Prints complete tuple
print(tuple2)

print(tuple1[0])       # First element of the tuple
print(tuple1[1:3])     # Elements starting from 2nd till 3rd 
print(tuple1[2:])      # Elements starting from 3rd element
print(tuple2 * 2)  # tuple two times
print(tuple1 + tuple2) # Concatenated tuple

print(type(tuple1))
print(type(tuple1[1]))

print(tuple1[0][1])

('abcd', 123, 10.2, 'EXAMPLE', 70.2)
(456, 'OTHER')
abcd
(123, 10.2)
(10.2, 'EXAMPLE', 70.2)
(456, 'OTHER', 456, 'OTHER')
('abcd', 123, 10.2, 'EXAMPLE', 70.2, 456, 'OTHER')
<class 'tuple'>
<class 'int'>
b


In [10]:
# Difference between list and tuple
list1 = ['abcd', 123 , 10.2, 'EXAMPLE', 70.2]
list1[2] = 999     # Valid syntax with list
print(list1)

tuple1 = ('abcd', 123 , 10.2, 'EXAMPLE', 70.2)
#tuple1[2] = 999    # Invalid syntax with tuple


['abcd', 123, 999, 'EXAMPLE', 70.2]


### Dictionary

A dictionary is a collection which is unordered, changeable and indexed. In Python dictionaries are written with curly brackets, and they have keys and values. It consist of key-value pairs. A dictionary key can be almost any Python type, but are usually numbers or strings.

Dictionaries are enclosed by curly brackets ({ }) and values can be assigned and accessed using square brackets ([]).

In [11]:
dict1 = {}
dict1['groupe'] = "A"
dict1['school'] = "HEC"
dict1['result'] = 90

print(dict1)          # Prints complete dictionary
print(dict1.keys())   # Prints all the keys
print(dict1.values()) # Prints all the values

print(dict1['groupe'])
print(dict1['result'])

dict2 = {'groupe':'B', 'school':"UL", 'result':88}
print(dict2.values()) 

{'groupe': 'A', 'school': 'HEC', 'result': 90}
dict_keys(['groupe', 'school', 'result'])
dict_values(['A', 'HEC', 90])
A
90
dict_values(['B', 'UL', 88])


## Variables and Argument Passages

When you assign a variable (or name) in Python, you create a **reference** to the object on the right side of the equal sign. For example, let's take a list of integers : 

In [9]:
a = [1,2,3]

Variable a is now assigned to a new variable b :

In [10]:
b = a 

In some languages, this assignment would result in the `[1, 2, 3]` data being copied. In Python, **A and B actually refer to the same object**, the original `[1, 2, 3]` list. You can prove this to yourself by adding an item to A and then looking at B :

In [11]:
a.append(4)
print(b)
# A change to "a" has resulted in a change to "b"!

[1, 2, 3, 4]


**Understanding** the semantics of Python references and knowing **when, how and why** data is copied is especially important when working with **large** databases. 

To get around this problem, it is possible to write :

- `B = list(A)`. Python's built-in function `list()`. 
- `B = [i for i in A]`. List comprehension object. We will see later in the training how these work. 

## Immutable Objects ##

A variable of type **immutable cannot be modified**. An operation on a variable of this type necessarily leads to the **creation of another variable of the same type**, even if the latter is temporary.

**Strings** and **tuples** are immutable objects. It is not possible to modify a variable of this type and it is necessary to recreate another of the same type which will integrate the modification.

In [12]:
# The statement x += 10 adds 10 to variable x which is initially at 1. 
# This is equivalent to writing x = x + 10. 
# Also works with x -= 10 to subtract; x *= to multiply etc...
# Variable must be initialized first. 
x = 1
x += 10
print(x)

11


In [13]:
# Example of a Tuple : 
# Remember that a tuple is in parentheses and that you can't modify its contain. 
is_Tuple = ("Orange", "Jaune", "Blue")
print(is_Tuple)

#is_Tuple[1]="Yellow"  #Error!

# Solution
is_tuple2 = list(is_Tuple)
is_tuple2[1]="Yellow"
is_Tuple=tuple(is_tuple2)
print(is_Tuple)

str1 = "abcxefg"
print(str1)
#str1[3]="d" #Error!

# Solution
str2 = str1[:3]+"d"+str1[4:]
print(str2)

('Orange', 'Jaune', 'Blue')
('Orange', 'Yellow', 'Blue')
abcxefg
abcdefg


# Import Packages
A **package** is a **set of functions and objects**. These are grouped together and made available so that they can be used without having to rewrite them.

These functions and objects allow you to do: numerical calculation, graphics, text formatting, document generation...

Some Python packages have become famous among the majority of Python users because of the efficiency of their modules. We can mention **NumPy, Matplotlib, Seaborn**, or **Pandas**, all useful for data science. It is possible to learn about the functions of a package by going directly to their **documentation site**. For example for pandas, you just have to type "pandas python" in the search bar. 

So we will see different methods to import a package: 

- **import "package name"**: The import instruction is used when you want to import an entire package. This is the most common method, because it saves time and requires only one line of code. However, this approach consumes more memory resources than the following technique.

In [17]:
# Importing the pandas package
import pandas
# Importing more than one package with a unique import command
import pandas,numpy

- **from "package name" import "function name"** : This command is used when you want to selectively import specific attributes of the package. This saves resources but at the cost of greater complexity.

In [18]:
from pandas import DataFrame

It is possible to use aliases by adding `as` after the import. **Aliases make it easier to call the different modules** in our program. The most commonly used are the following:

In [19]:
import matplotlib as plt
import pandas as pd
import numpy as np
import seaborn as sns

# Define the Current Directory 

Setting the current directory is useful for specifying in which directory Python should look for importing data files. To check the current directory, execute the following command :

In [14]:
# Importing the os package
import os
os.getcwd()

'c:\\Git\\GitHub\\LearnAIML\\Python\\PythonTraining\\Part1-Basics'

**For IDEs installed on the computer** :

To change  the current directory, just specify the full path of the new directory, as below :

In [21]:
import os
os.chdir("C:\\Users\\sarah\\Documents\\HEC\\École des dirigeants")

**For Google Colab IDE** :

In [5]:
from google.colab import drive
drive.mount("/content/drive")

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [9]:
import os
os.chdir("/content/drive/My Drive/Colab Notebooks/Ericsson")

The current directory tells Python in which directory to search for the requested data if no path is specified for example.