## AI2E - 1 - Data Exploration (Part 1) 
This is the first notebook of a serie of Machine learning workshops. To start our journey in the best way possible, we will get to know the tools we're going to work with : Anaconda, Jupyter Notebook. We will conclude this workshop by going through a small algerian dataset and getting to know interesting python packages that will help us explore the datasets and study them. 

### Content 
1. Python Cheat Sheet 
2. What is Anaconda ?  
2. What is Jupyter Notebook ? How can I use it ? 
3. **Welcome to NumPy**
4. **Data exploration (Part 1)**

Machine learning involves a lot of matrix math, and it‚Äôs important for you to understand the basics before diving into building your own first model. 

This notebook provides a short refresher on matrix operation and how to deal with the data using two fundamental packages : NumPy and Pandas. 

[Numpy](https://numpy.org/) is a Python open-source library used in nearly every area of science and engineering.
* It provides ndarray, a homogeneous n-dimensional array object, with methods to efficiently operate on it. 
* It is at the heart of many commonly used packages : Pandas, SciPy, Matplotlib, scikit-learn, scikit-image. 

In [12]:
# If the following import doesn't recognize numpy or pandas ... oh yes we can even run commands :p  
!pip install numpy
!pip install pandas 



In [13]:
# "as" is used to give an alias (rather than use numpy.array() we will use np.array()) 
import numpy as np
import pandas as pd

  You can store any data types from :
  * **Scalars** : single value. 
  * **Vectors** :
      * Array of values. 
      * Can either vertical(column) or horizontal(row). 
      * Index starts at 0 
  * **Matrices** : 2D Arrays.
  * **Tensors**  : 2D Arrays x Depth.

In [14]:
a = np.array(3)                                                       # Scalar 
b = np.array([1,2,3, 4, 5, 6, 7, 8, 9])                               # Vector
c = np.array([[1,2,3], [4,5,6], [7,8,9]])                             # Matrix
d = np.array([[[[1],[2]],[[3],[4]],[[5],[6]]],[[[7],[8]],\
    [[9],[10]],[[11],[12]]],[[[13],[14]],[[15],[16]],[[17],[18]]]])   # Tensor 

In [27]:
print('Scalar', a.ndim)
print('Vector',b.ndim)
print('Matrix',c.ndim)
print('Tensor',d.ndim)

Scalar 0
Vector 1
Matrix 2
Tensor 4


In [28]:
print('Scalar', a.size)
print('Vector',b.size)
print('Matrix',c.size)
print('Tensor',d.size)

Scalar 1
Vector 9
Matrix 9
Tensor 18


In [29]:
print('Scalar', a.shape)
print('Vector',b.shape)
print('Matrix',c.shape)
print('Tensor',d.shape)

Scalar ()
Vector (9,)
Matrix (3, 3)
Tensor (3, 3, 2, 1)


In [15]:
a

array(3)

In [17]:
a+2

5

In [18]:
b.shape

(9,)

In [19]:
b[1]

2

#### Slicing ! 

In [20]:
# Get all the values of b from index 2 to 4 (included)
b[2:5]

array([3, 4, 5])

In [None]:
# Question : Get all the values from index 3 till the end 

In [21]:
# Last value in an array 
b[-1]

9

In [22]:
# Delete element in position 5 
np.delete(b, 4)

array([1, 2, 3, 4, 6, 7, 8, 9])

In [24]:
# Add element to the array 
np.append(b,10)

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10])

In [None]:
# Question : Get all values of the second column c that are greater than 5 

In [30]:
# Get all  values of c that are > 5 
c[c>5]

array([6, 7, 8, 9])

In [40]:
# Logical operators & and | 
c>5

array([[False, False, False],
       [False, False,  True],
       [ True,  True,  True]])

In [42]:
indexes = np.where(c>5)
indexes

(array([1, 2, 2, 2]), array([2, 0, 1, 2]))

In [31]:
# Question : Get All values of d that are between 4 and 8 

In [32]:
# Reshape vector b to be like vector c 
b.reshape(3,3)

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In [33]:
b

array([1, 2, 3, 4, 5, 6, 7, 8, 9])

**Create some special arrays**

In [34]:
e = np.zeros(2)
e

array([0., 0.])

In [35]:
f = np.ones(2 , dtype=int)
g = np.ones(2 , dtype=float)
print(f)
print(g)

[1 1]
[1. 1.]


In [36]:
b = np.arange(10)
b

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [39]:
h = np.linspace(0,10,5)
h

array([ 0. ,  2.5,  5. ,  7.5, 10. ])

**Array Operations**

In [47]:
b

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [48]:
a = np.arange(10,20)

In [50]:
a+b

array([10, 12, 14, 16, 18, 20, 22, 24, 26, 28])

In [51]:
a-b

array([10, 10, 10, 10, 10, 10, 10, 10, 10, 10])

In [52]:
b.sum()

45

In [53]:
c.sum(axis=0)

array([12, 15, 18])

**Broadcasting** 

In [55]:
b * 10

array([ 0, 10, 20, 30, 40, 50, 60, 70, 80, 90])

In [56]:
a = np.array([1,2,3])
c + a 

array([[ 2,  4,  6],
       [ 5,  7,  9],
       [ 8, 10, 12]])

NumPy also performs aggregation functions. In addition to min, max, and sum, you can easily run mean to get the average, prod to get the result of multiplying the elements together, std to get the standard deviation, and more.

You will notice in the next session that these statistics are really important to study our datasets in the best way possible. 

In [57]:
c.min()

1

In [58]:
c.min(axis=1)

array([1, 4, 7])

In [60]:
c.mean()

5.0

This is a general overview of what we can do with NumPy. We encourage you to keep testing these methods and understand the slicing and broadcasting as these concepts are really important for the rest of our sessions. 

You can also check these resources to get more about NumPy: 
* The ultimate beginners guide to Numpy : https://towardsdatascience.com/the-ultimate-beginners-guide-to-numpy-f5a2f99aef54 
* Data Analysis with Numpy : https://www.dataquest.io/blog/numpy-tutorial-python/

## Let's play with Data ! üéâüöÄ

The dataset that we're going to work on is provided by an Algerian VTC Company. 
It includes variables about : 
* Address of departure
* Address of arrival
* Distance
* Price 
* Date 
* ... 

We will explore the dataset using pandas. 

[Pandas](https://pandas.pydata.org/) is an open-source Python library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. (If you're familiar with SQL you can think of pandas methods as an SQL request) 

We have two important components in pandas : 
* Series : is described as a column 
* DataFrame : is a multi-dimensional table made up of a collection of Series.


In [61]:
# read a csv file
data = pd.read_csv("dataset/vtc_data.csv")

In [62]:
#print the first 5 rows 
data.head()

Unnamed: 0,travel_id,travel_type,car_type,driver_id,address_of_departure,arrival_address,lat_and_long_of_arrival_address,date_of_travel,time_of_travel,estimated_time,distance,options,state,price
0,27550488,advance,Mini Citadine,,"AXA Assurances Alg√©rie, Boulevard du 11 Decemb...","Clinique Krim Belkacem, Boulevard Colonel Krim...","36.770681600000003,3.0510609999999998",2019-09-29 23:55:00,23:55:00,13.0,6053.0,,cancelled,0.0
1,28026204,live,Mini Citadine,43166.0,Unnamed Road Dar El Be√Øda,"136 logements, bloc Dÿå Route de Ouled Fayet Ch...","36.745614799999998,2.9428641",2019-09-29 23:54:36,23:54:00,33.0,31895.0,,finished,133300.0
2,28026119,live,Mini Citadine,30759.0,"Office Riadh El Feth ,bois des arcades ,El Mad...","79 Rue Fabri MARCELLO, Bir Mourad Ra√Øs, ÈòøÂ∞îÂèäÂà©‰∫ö","36.736257999999999,3.0616660000000002",2019-09-29 23:54:27,23:54:00,7.0,3056.0,,finished,32600.0
3,28025871,live,Mini Citadine,32536.0,N24 Bordj El Kiffan,"Dar El Be√Øda, Algeria","36.727856899999999,3.2181302000000001",2019-09-29 23:47:25,23:47:00,7.0,6092.0,,finished,38300.0
4,28026102,live,Mini Citadine,40592.0,"Lily Rose, Hydra, Algeria","214 r√©sidenceÿå Bois des Cars 2, Deli Ibrahim, ...","36.753667999999998,2.9762750000000002",2019-09-29 23:46:35,23:46:00,11.0,6928.0,,finished,47800.0


In [63]:
#print the last 5 rows
data.tail()

Unnamed: 0,travel_id,travel_type,car_type,driver_id,address_of_departure,arrival_address,lat_and_long_of_arrival_address,date_of_travel,time_of_travel,estimated_time,distance,options,state,price
65330,26076425,live,Mini Citadine,42286.0,"Rue Raoul PAYEN, Hydra, Alg√©rie","Bordj El Bahri, Alg√©rie","36.785442800000013,3.2643347999999999",2019-08-26 18:37:02,18:37:00,43.0,32420.0,,finished,119400.0
65331,26076320,live,Mini Citadine,39729.0,"Egeco, Algiers, Alg√©rie",Chemin wilaya Nord 42 Kouba,"36.718478645545801,3.09075839817524",2019-08-26 18:36:57,18:36:00,6.0,2444.0,,operational_cancelled,0.0
65332,26076305,live,Mini Citadine,41170.0,Rue Ouahrani Abdelkader Oued Romane El Achour,Route AIN ALLAH Ain Allah Dely Ibrahim,"36.753617400324998,2.9945344850420952",2019-08-26 18:36:56,18:36:00,8.0,2582.0,,finished,27100.0
65333,26076433,live,Mini Citadine,34783.0,"26 Boulevard Colonel Bougara, Alger Ctre 16000...","Bab Ezzouar, Alg√©rie","36.712336899999997,3.1968659000000001",2019-08-26 18:36:37,18:36:00,21.0,18408.0,,finished,69600.0
65334,26076033,live,Mini Citadine,39757.0,"N1, Bir Mourad Ra√Øs, Alg√©rie",Hydraÿå Alg√©rie,"36.746026299999997,3.038882000000001",2019-08-26 18:35:57,18:35:00,7.0,2030.0,,finished,24700.0


In [65]:
data.shape

(65335, 14)

In [66]:
rows_number = data.shape[0]
column_number = data.shape[1]

In [76]:
# get all the column names 
data.columns

Index(['travel_id', 'travel_type', 'car_type', 'driver_id',
       'address_of_departure', 'arrival_address',
       'lat_and_long_of_arrival_address', 'date_of_travel', 'time_of_travel',
       'estimated_time', 'distance', 'options', 'state', 'price'],
      dtype='object')

In [69]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65335 entries, 0 to 65334
Data columns (total 14 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   travel_id                        65335 non-null  int64  
 1   travel_type                      65335 non-null  object 
 2   car_type                         65335 non-null  object 
 3   driver_id                        60229 non-null  float64
 4   address_of_departure             65335 non-null  object 
 5   arrival_address                  65292 non-null  object 
 6   lat_and_long_of_arrival_address  65292 non-null  object 
 7   date_of_travel                   65335 non-null  object 
 8   time_of_travel                   65121 non-null  object 
 9   estimated_time                   65335 non-null  float64
 10  distance                         65078 non-null  float64
 11  options                          1152 non-null   object 
 12  state             

In [71]:
# Locate one travel by its index 
data.loc[1]

travel_id                                                                   28026204
travel_type                                                                     live
car_type                                                               Mini Citadine
driver_id                                                                      43166
address_of_departure                                       Unnamed Road Dar El Be√Øda
arrival_address                    136 logements, bloc Dÿå Route de Ouled Fayet Ch...
lat_and_long_of_arrival_address                         36.745614799999998,2.9428641
date_of_travel                                                   2019-09-29 23:54:36
time_of_travel                                                              23:54:00
estimated_time                                                                    33
distance                                                                       31895
options                                                        

In [74]:
data.loc[data["travel_id"] == 28026204 ]

Unnamed: 0,travel_id,travel_type,car_type,driver_id,address_of_departure,arrival_address,lat_and_long_of_arrival_address,date_of_travel,time_of_travel,estimated_time,distance,options,state,price
1,28026204,live,Mini Citadine,43166.0,Unnamed Road Dar El Be√Øda,"136 logements, bloc Dÿå Route de Ouled Fayet Ch...","36.745614799999998,2.9428641",2019-09-29 23:54:36,23:54:00,33.0,31895.0,,finished,133300.0


In [75]:
data.loc[data["travel_type"] == "live"]

Unnamed: 0,travel_id,travel_type,car_type,driver_id,address_of_departure,arrival_address,lat_and_long_of_arrival_address,date_of_travel,time_of_travel,estimated_time,distance,options,state,price
1,28026204,live,Mini Citadine,43166.0,Unnamed Road Dar El Be√Øda,"136 logements, bloc Dÿå Route de Ouled Fayet Ch...","36.745614799999998,2.9428641",2019-09-29 23:54:36,23:54:00,33.0,31895.0,,finished,133300.0
2,28026119,live,Mini Citadine,30759.0,"Office Riadh El Feth ,bois des arcades ,El Mad...","79 Rue Fabri MARCELLO, Bir Mourad Ra√Øs, ÈòøÂ∞îÂèäÂà©‰∫ö","36.736257999999999,3.0616660000000002",2019-09-29 23:54:27,23:54:00,7.0,3056.0,,finished,32600.0
3,28025871,live,Mini Citadine,32536.0,N24 Bordj El Kiffan,"Dar El Be√Øda, Algeria","36.727856899999999,3.2181302000000001",2019-09-29 23:47:25,23:47:00,7.0,6092.0,,finished,38300.0
4,28026102,live,Mini Citadine,40592.0,"Lily Rose, Hydra, Algeria","214 r√©sidenceÿå Bois des Cars 2, Deli Ibrahim, ...","36.753667999999998,2.9762750000000002",2019-09-29 23:46:35,23:46:00,11.0,6928.0,,finished,47800.0
5,28025940,live,Mini Citadine,29567.0,"18 lot bouchebouk, Dely Ibrahim, Alg√©rie","2 piliers, Bouzareah, Algiers, Alg√©rie","36.785158799999998,3.0112462",2019-09-29 23:44:27,23:44:00,13.0,7061.0,,finished,45700.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
65330,26076425,live,Mini Citadine,42286.0,"Rue Raoul PAYEN, Hydra, Alg√©rie","Bordj El Bahri, Alg√©rie","36.785442800000013,3.2643347999999999",2019-08-26 18:37:02,18:37:00,43.0,32420.0,,finished,119400.0
65331,26076320,live,Mini Citadine,39729.0,"Egeco, Algiers, Alg√©rie",Chemin wilaya Nord 42 Kouba,"36.718478645545801,3.09075839817524",2019-08-26 18:36:57,18:36:00,6.0,2444.0,,operational_cancelled,0.0
65332,26076305,live,Mini Citadine,41170.0,Rue Ouahrani Abdelkader Oued Romane El Achour,Route AIN ALLAH Ain Allah Dely Ibrahim,"36.753617400324998,2.9945344850420952",2019-08-26 18:36:56,18:36:00,8.0,2582.0,,finished,27100.0
65333,26076433,live,Mini Citadine,34783.0,"26 Boulevard Colonel Bougara, Alger Ctre 16000...","Bab Ezzouar, Alg√©rie","36.712336899999997,3.1968659000000001",2019-08-26 18:36:37,18:36:00,21.0,18408.0,,finished,69600.0


In [67]:
# Get the list of one column 
data["travel_type"]

0        advance
1           live
2           live
3           live
4           live
          ...   
65330       live
65331       live
65332       live
65333       live
65334       live
Name: travel_type, Length: 65335, dtype: object

In [68]:
# What are the different possible values of travel_type ?
data["travel_type"].unique()

array(['advance', 'live', 'Personal'], dtype=object)

In [70]:
# How many travel rows do we have for each type ?
data.groupby("travel_type").size()

travel_type
Personal      214
advance     11484
live        53637
dtype: int64

In [77]:
# How many travel rows do we have for each type ?
data['travel_type'].value_counts()

live        53637
advance     11484
Personal      214
Name: travel_type, dtype: int64

In [78]:
# Question : What are the different possible values of car_type ?

In [None]:
# Question : How many travel rows do we have for each type and each car_type ? 

##  Conclusion & Keypoints

* Anaconda is a package and environment manager. 
* Working with environment keeps your projects organized and avoid version issues. 
* Jupyter notebook allows us to bring together the code, documentation and visualizations in one document. 
* Behind all machine learning algorithm is a bunch of matrix operations. 
* NumPy is very important package that provides a fast and efficient array structure (ndarrays).
* NumPy arrays offer the ability to use slicing, broadcasting and aggregate important statistical operations.
* Pandas help us read our data in an SQL-like manner 
* With pandas, we can explore our data row by row, column by column. 

In the next session, we will continue exploring the vtc dataset. We will understand why the statistical operations are important in machine learning and discover another package *matplotlib* that will help visualize different aspects of the data. 