#Introduction

This notebook gives some reminders on the [Pandas DataFrames and Series structures](https://pandas.pydata.org/docs/getting_started/dsintro.html).

## Series


`Series`structures can be created from the following data types:
- scalar values 
- Python native dictionaries
- multidimensional arrays (called ndarrays).

A series is a vector of values taken by a variable. Usually it would represent the values taken by a variable for different observations (or individuals).

In [None]:
import pandas as pd

 ### from scalar values

In [None]:
# Program to Create series with scalar values  
data_points =[1, 3, 4, 5, 6, 2, 9]  # Numeric data 
  
# Creating series with default index values 
s = pd.Series(data_points) 


In [None]:
# predefined index values 
index =['a', 'b', 'c', 'd', 'e', 'f', 'g']  
  
# Creating series with predefined index values 
si = pd.Series(data_points, index) 

In [None]:
si

a    1
b    3
c    4
d    5
e    6
f    2
g    9
dtype: int64

In [None]:
si['f']  # direct indexing

2

### from a dictionary

A dictionary is a key-value mapping. Where values are indexed by their position in a list, here values are indexed by a key (which can take any hashable type, for instance alpha-numerical characters).

In [None]:
# Program to Create Dictionary series 
dictionary ={'a':1, 'b':2, 'c':3, 'd':4, 'e':5}  
  
# Creating series of Dictionary type 
sd = pd.Series(dictionary)
sd

a    1
b    2
c    3
d    4
e    5
dtype: int64

###  from a Ndarray

In [None]:
import pandas as pd
# Program to Create ndarray series 
nddata =[[2, 3, 4], [5, 6, 7]]  # Defining 2darray 
  
# Creating series of 2darray 
snd = pd.Series(nddata) 

In [None]:
nddata, nddata[0][1]

([[2, 3, 4], [5, 6, 7]], 3)

In [None]:
snd[1][1]

6

## DataFrames

A DataFrame is a 2-dimensional data structure : several columns contain the variables, with their observations indexed on the rows.

It can be built from the same kind of data as `Series`:
- one or more scalar vectors
- one or more dictionaries
- 2D-numpy ndarray

In [None]:
# Program to Create Data Frame with two dictionaries 
dict1 ={'a':1, 'b':2, 'c':3, 'd':4}        # Define Dictionary 1 
dict2 ={'a':5, 'b':6, 'c':7, 'd':8, 'e':9} # Define Dictionary 2 
data = {'first':dict1, 'second':dict2}  # Define Data with dict1 and dict2 
df = pd.DataFrame(data)  # Create DataFrame 

In [None]:
df  # note that the missing value is filled with a NaN by default (Not a Number)

Unnamed: 0,first,second
a,1.0,5
b,2.0,6
c,3.0,7
d,4.0,8
e,,9


###  from Series

A DataFrame can also be created from a set of series, for instance as follows:

In [None]:
# Program to create Dataframe of three series  
import pandas as pd 
  
s1 = pd.Series([1, 3, 4, 5, 6, 2, 9])           # Define series 1 
s2 = pd.Series([1.1, 3.5, 4.7, 5.8, 2.9, 9.3]) # Define series 2 
s3 = pd.Series(['a', 'b', 'c', 'd', 'e'])     # Define series 3 
  
  
Data ={'first':s1, 'second':s2, 'third':s3} # Define Data 
dfseries = pd.DataFrame(Data)              # Create DataFrame 

### from 2D-numpy ndarray

In [None]:
# Program to create DataFrame from 2D array 
import pandas as pd # Import Library 
d1 =[[2, 3, 4], [5, 6, 7]] # Define 2d array 1 
d2 =[[2, 4, 8], [1, 3, 9]] # Define 2d array 2 
Data ={'first': d1, 'second': d2} # Define Data  
df2d = pd.DataFrame(Data)    # Create DataFrame 

In [None]:
df2d

Unnamed: 0,first,second
0,"[2, 3, 4]","[2, 4, 8]"
1,"[5, 6, 7]","[1, 3, 9]"


## Tidy data

It is useful to organize a DataFrame as [_tidy data_](https://vita.had.co.nz/papers/tidy-data.pdf):

"A dataset is a collection of **values**, usually either numbers (if quantitative) or strings (if
qualitative). Values are organised in two ways. Every value belongs to a **variable** and an
**observation**. A variable contains all values that measure the same underlying attribute (like
height, temperature, duration) across units. An observation contains all values measured on
the same unit (like a person, or a day, or a race) across attributes."

In tidy data:
1. Each variable forms a column.
2. Each observation forms a row.
3. Each type of observational unit forms a table"


# Exercise 

We are going to play with an automobile dataset, in which each column gives a different feature of a car, such as body shape, motor type, price, etc.

You can download the data at [this url](https://github.com/annemariet/tutorials/blob/master/data/Automobile_data.csv). 




Load the dataframe and print the first 10 and last 10 lines:

- use the github "raw" button to get the link to the raw content
- pandas read_csv can read from urls
- use `head` and `tail` methods.

In [None]:
# load df
# load df
import pandas as pd
url = "https://raw.githubusercontent.com/annemariet/tutorials/master/data/Automobile_data.csv"
c = pd.read_csv(url)
print(c)

    index      company   body-style  ...  horsepower  average-mileage    price
0       0  alfa-romero  convertible  ...         111               21  13495.0
1       1  alfa-romero  convertible  ...         111               21  16500.0
2       2  alfa-romero    hatchback  ...         154               19  16500.0
3       3         audi        sedan  ...         102               24  13950.0
4       4         audi        sedan  ...         115               18  17450.0
..    ...          ...          ...  ...         ...              ...      ...
56     81   volkswagen        sedan  ...          85               27   7975.0
57     82   volkswagen        sedan  ...          52               37   7995.0
58     86   volkswagen        sedan  ...         100               26   9995.0
59     87        volvo        sedan  ...         114               23  12940.0
60     88        volvo        wagon  ...         114               23  13415.0

[61 rows x 10 columns]


In [None]:
# head
c.head()

Unnamed: 0,index,company,body-style,wheel-base,length,engine-type,num-of-cylinders,horsepower,average-mileage,price
0,0,alfa-romero,convertible,88.6,168.8,dohc,four,111,21,13495.0
1,1,alfa-romero,convertible,88.6,168.8,dohc,four,111,21,16500.0
2,2,alfa-romero,hatchback,94.5,171.2,ohcv,six,154,19,16500.0
3,3,audi,sedan,99.8,176.6,ohc,four,102,24,13950.0
4,4,audi,sedan,99.4,176.6,ohc,five,115,18,17450.0


In [None]:
# tail
c.tail()

Unnamed: 0,index,company,body-style,wheel-base,length,engine-type,num-of-cylinders,horsepower,average-mileage,price
56,81,volkswagen,sedan,97.3,171.7,ohc,four,85,27,7975.0
57,82,volkswagen,sedan,97.3,171.7,ohc,four,52,37,7995.0
58,86,volkswagen,sedan,97.3,171.7,ohc,four,100,26,9995.0
59,87,volvo,sedan,104.3,188.8,ohc,four,114,23,12940.0
60,88,volvo,wagon,104.3,188.8,ohc,four,114,23,13415.0


You can access to the content of a column by indexing the dataframe with the column name, which returns a `pd.Series`. You can view the list of columns with `df.columns`.

In [None]:
d1=c['company']
d1

0     alfa-romero
1     alfa-romero
2     alfa-romero
3            audi
4            audi
         ...     
56     volkswagen
57     volkswagen
58     volkswagen
59          volvo
60          volvo
Name: company, Length: 61, dtype: object

In [None]:
c.columns

Index(['index', 'company', 'body-style', 'wheel-base', 'length', 'engine-type',
       'num-of-cylinders', 'horsepower', 'average-mileage', 'price'],
      dtype='object')

What is the company with the most expensive car?
- using and filtering, you can select rows for which a predicate is true, eg: `df[df["<column>"]==<value>]`.
- using `df.loc[<index>]` you can select rows at the given indexes.
- pandas offers both `max` and `idxmax` methods.

In [None]:
c[c['price']==c['price'].max()]


Unnamed: 0,index,company,body-style,wheel-base,length,engine-type,num-of-cylinders,horsepower,average-mileage,price
35,47,mercedes-benz,hardtop,112.0,199.2,ohcv,eight,184,14,45400.0


Print the details of all the Toyota cars




In [None]:
# answer
c[c["company"]=="toyota"]

Unnamed: 0,index,company,body-style,wheel-base,length,engine-type,num-of-cylinders,horsepower,average-mileage,price
48,66,toyota,hatchback,95.7,158.7,ohc,four,62,35,5348.0
49,67,toyota,hatchback,95.7,158.7,ohc,four,62,31,6338.0
50,68,toyota,hatchback,95.7,158.7,ohc,four,62,31,6488.0
51,69,toyota,wagon,95.7,169.7,ohc,four,62,31,6918.0
52,70,toyota,wagon,95.7,169.7,ohc,four,62,27,7898.0
53,71,toyota,wagon,95.7,169.7,ohc,four,62,27,8778.0
54,79,toyota,wagon,104.5,187.8,dohc,six,156,19,15750.0


You can use `<Series>.value_counts()` to count the number of cars per company.

In [None]:
c["company"].value_counts()


c[c['price']==c['price'].max()]


toyota           7
bmw              6
mazda            5
nissan           5
mitsubishi       4
volkswagen       4
audi             4
mercedes-benz    4
jaguar           3
porsche          3
isuzu            3
chevrolet        3
honda            3
alfa-romero      3
dodge            2
volvo            2
Name: company, dtype: int64

Find the most expensive car for each company. For this you can use the `groupby` method.

In [None]:
c.groupby("company")['price','index'].max()




  """Entry point for launching an IPython kernel.


Unnamed: 0_level_0,price,index
company,Unnamed: 1_level_1,Unnamed: 2_level_1
alfa-romero,16500.0,2
audi,18920.0,6
bmw,41315.0,15
chevrolet,6575.0,18
dodge,6377.0,20
honda,12945.0,29
isuzu,6785.0,32
jaguar,36000.0,35
mazda,18344.0,43
mercedes-benz,45400.0,47


Groupbys allow for a variety of aggregation functions. Can you compute the average mileage by company?

In [None]:
c.groupby("company")['average-mileage'].mean()


company
alfa-romero      20.333333
audi             20.000000
bmw              19.000000
chevrolet        41.000000
dodge            31.000000
honda            26.333333
isuzu            33.333333
jaguar           14.333333
mazda            28.000000
mercedes-benz    18.000000
mitsubishi       29.500000
nissan           31.400000
porsche          17.000000
toyota           28.714286
volkswagen       31.750000
volvo            23.000000
Name: average-mileage, dtype: float64

Sort all cars by decreasing price, using the `sort_values` method.

In [None]:
c.sort_values('price')

Unnamed: 0,index,company,body-style,wheel-base,length,engine-type,num-of-cylinders,horsepower,average-mileage,price
13,16,chevrolet,hatchback,88.4,141.1,l,three,48,47,5151.0
27,36,mazda,hatchback,93.1,159.1,ohc,four,68,30,5195.0
48,66,toyota,hatchback,95.7,158.7,ohc,four,62,35,5348.0
36,49,mitsubishi,hatchback,93.7,157.3,ohc,four,68,37,5389.0
28,37,mazda,hatchback,93.1,159.1,ohc,four,68,31,6095.0
...,...,...,...,...,...,...,...,...,...,...
11,14,bmw,sedan,103.5,193.8,ohc,six,182,16,41315.0
35,47,mercedes-benz,hardtop,112.0,199.2,ohcv,eight,184,14,45400.0
22,31,isuzu,sedan,94.5,155.9,ohc,four,70,38,
23,32,isuzu,sedan,94.5,155.9,ohc,four,70,38,


Pandas gives you access to merge and concatenate functions. Create two dataframes from the following dictionaries and merge them to get a single dataframe with 3 columns: Company, Price, horsepower, and 4 lines.

In [None]:
Car_Price = {'Company': ['Toyota', 'Honda', 'BMV', 'Audi'], 'Price': [23845, 17995, 135925 , 71400]}
car_Horsepower = {'Company': ['Toyota', 'Honda', 'BMV', 'Audi'], 'horsepower': [141, 80, 182 , 160]}

df1 = pd.DataFrame(Car_Price)
df2= pd.DataFrame(car_Horsepower)

frames = [df1, df2]
result = pd.concat(frames)

In [None]:
result.head()

Unnamed: 0,Company,Price,horsepower
0,Toyota,23845.0,
1,Honda,17995.0,
2,BMV,135925.0,
3,Audi,71400.0,
0,Toyota,,141.0


- write a program to change the order of a pandas Series. Create a series indexed with 'A', 'B', 'C'... like this:
- A 1
- B 2
- C 3
- D 4
- E 5

and reorder using a new list such as 'B', 'D', 'E'..., using `reindex`.

In [None]:
# code

# NumPy

The numpy library (http://www.numpy.org/) is the go-to library for numerical analysis in Python;

In [None]:
import numpy as np

In [None]:
np.pi

3.141592653589793

## Arrays with numpy.array()

### Creation
You can create an array from a list (1-d vector), or a list of lists of the same lengths (2-d matrix).
You can also create empty arrays, arrays of zeros and ones, or random arrays of any given size and value type (int, float...). 

In [None]:
a = np.array([[1, 2, 3], [4, 5, 6]])

In [None]:
a

array([[1, 2, 3],
       [4, 5, 6]])

In [None]:
type(a), a.dtype


(numpy.ndarray, dtype('int64'))


### Accessing elements

In [None]:
a[0,1]

2

In [None]:
a[1,2]

### numpy.arange()

`numpy.arange` gives you a range from a to b (excluded), increasing with the given step (defaulting to 1).

In [None]:
m = np.arange(3, 15, 2)
m

array([ 3,  5,  7,  9, 11, 13])

Note the difference between `numpy.arange()` and native Python `range()`:

- `numpy.arange()` returns a numpy.ndarray.
- `range()` returns an object of type `range`, which is an iterator.

In [None]:
type(m)

numpy.ndarray

In [None]:
n = range(3, 15, 2)
type(n)

range

`numpy.arange()` accepts non-integer inputs.

In [None]:
np.arange(0, 11*np.pi, np.pi)

array([ 0.        ,  3.14159265,  6.28318531,  9.42477796, 12.56637061,
       15.70796327, 18.84955592, 21.99114858, 25.13274123, 28.27433388,
       31.41592654])

### numpy.linspace()
`numpy.linspace()` has a differnt API: it takes a range [a, b] (b included) and a number of values rather than a step.

In [None]:
np.linspace(3, 9, 10)

array([3.        , 3.66666667, 4.33333333, 5.        , 5.66666667,
       6.33333333, 7.        , 7.66666667, 8.33333333, 9.        ])

## Applying mathematical functions

`numpy`gives you a number of mathematical functions, which can be applied to numpy arrays, ie to each element individually: `sin`, `cos`, `log` `exp`...


In [None]:
x = np.linspace(-np.pi/2, np.pi/2, 3)
y = np.sin(x)
y

array([-1.,  0.,  1.])

# Exercise

Create a 4x2 integer array (of type unsigned int16) and print the following attributes:
- the shape `shape`,
- the number of dimensions `ndims`,
- the size in bytes of each element `itemsize`.

Compare also with `nbytes` and `size`.

In [None]:
import numpy as np
my_array=np.zeros((4,2))

In [None]:
print(my_array.ndim)

2


- Create an array of size 5x2, with values ranging from 100 to 200, such that the different between two consecutive elements is 10. You can use `arange` and `reshape`.

In [None]:
my_array2=np.arange(100,200,10).reshape((5,2))



In [None]:
print(my_array2)

[[100 110]
 [120 130]
 [140 150]
 [160 170]
 [180 190]]


Given the following array, can you print the third column only?


In [None]:
import numpy
sampleArray = numpy.array([[11 ,22, 33], [44, 55, 66], [77, 88, 99]])

In [None]:
print(sampleArray.shape)

(3, 3)


In [None]:
print(sampleArray[:,2])

[33 66 99]


In [None]:
# code

Printing Input Array
[[11 22 33]
 [44 55 66]
 [77 88 99]]

 Printing array of items in the third column from all rows
[33 66 99]


Given the following array, can you return only odd rows and even columns? (considering the mathematical numbering with row 1 at index 0). 


In [None]:
import numpy
sampleArray = numpy.array([[3 ,6, 9, 12], [15 ,18, 21, 24], 
[27 ,30, 33, 36], [39 ,42, 45, 48], [51 ,54, 57, 60]])

SyntaxError: ignored

In [None]:
# code
sampleArray[::2]#odd rows

array([[ 3,  6,  9, 12],
       [27, 30, 33, 36],
       [51, 54, 57, 60]])

In [None]:
sampleArray[:, 1::2]

array([[ 6, 12],
       [18, 24],
       [30, 36],
       [42, 48],
       [54, 60]])

Let A, B be two array of the same size, compute C such that $c_i = \sqrt{a_i + b_i}$.

In [None]:
import numpy
arrayOne = numpy.array([[5, 6, 9], [21 ,18, 27]])
arrayTwo = numpy.array([[15 ,33, 24], [4 ,7, 1]])

In [None]:
from math import sqrt
def f(a,b):
    return sqrt(a+b)

f2=np.vectorize(f)
shape=arrayOne.shape
result=[]
for i in range(shape[0]):
  for j in range(shape[1]):
    u=f(arrayOne[i][j],arrayTwo[i][j])
    result.append(u)
result=np.array(result).reshape(shape)
print(result)

[[4.47213595 6.244998   5.74456265]
 [5.         5.         5.29150262]]


Create a new integer array of size 8x3, with values ranging from 10 to 34 with step size=1. Split the array into 4 subarrays (using `split`).


In [None]:
my_array2=np.arange(10,34,1).reshape((8,3))
np.array_split(my_array2,4)

[array([[10, 11, 12],
        [13, 14, 15]]), array([[16, 17, 18],
        [19, 20, 21]]), array([[22, 23, 24],
        [25, 26, 27]]), array([[28, 29, 30],
        [31, 32, 33]])]

Sort the array:
- along the second row
- along the second column

In [None]:
import numpy
sampleArray = numpy.array([[34,43,73],[82,22,12],[53,94,66]])

In [None]:
# code
np.sort(sampleArray,axis=0)
np.sort(sampleArray,axis=1)


array([[ 3,  6,  9, 12],
       [15, 18, 21, 24],
       [27, 30, 33, 36],
       [39, 42, 45, 48],
       [51, 54, 57, 60]])

Given the following array, print the max along axis 0 and the min along axis 1.

In [None]:
import numpy
sampleArray = numpy.array([[34,43,73],[82,22,12],[53,94,66]])

In [None]:
# code
print(np.max(sampleArray,axis=0))
print(np.max(sampleArray,axis=1))

[82 94 73]
[73 82 94]


Given the following array, remove the second column and replace it with the new column values using `delete` and `insert`. Print the intermediate results.

In [None]:
import numpy
sampleArray = numpy.array([[34,43,73],[82,22,12],[53,94,66]]) 
newColumn = numpy.array([[10,10,10]]) 

In [None]:
np.delete(sampleArray,1)

array([34, 73, 82, 22, 12, 53, 94, 66])

In [None]:
np.insert(samplArray,[:,1],newColumn)