# Panda

### diverses infos
* tools for reading and writing data
* data alignment and integrated handling of missing data
* the ability to perform arithmetic operations on the data
* easy reshaping and pivoting of data sets
* user-friendly operations for merging and joining data
* the ability to handle time series

### fonctions vues 
* The iloc operator allows us to slice both rows and columns using their position. The general syntax is as follows 
```python
df.iloc[rows, columns]
```

In [11]:
#import
import pandas as pd
import numpy as np

## 03. Series and DataFrames

In [10]:
## SERIES 
my_Series = pd.Series([1,'cat',10.2,'dog'])
my_Series
#0       1
#1     cat
#2    10.2
#3     dog

### POSSIBILITE DE CHANGER L'INDEX 

ages = pd.Series([20,53,68], index=['John', 'Allen', 'Mary'])
ages
#John     20
#Allen    53
#Mary     68
#dtype: int64
ages['John']

## DATAFRAMES

df = pd.DataFrame( {'user' : [1,2,3],
            'age' : [24,54,17],
            'sex' : ['F','F','M'],
            'occupation' : ['technician','musician','student']})
df.set_index('user')
#displaying the number of rows
df.shape[0]
#displaying the number of columns
df.shape[1]
#displaying the labels of all the columns
df.columns
#displaying the data types of each column
df.dtypes

df.describe()

df['occupation']

0    technician
1      musician
2       student
Name: occupation, dtype: object

## 05. Manipulating the data

In [25]:
df = pd.DataFrame(np.arange(9).reshape(3,3), columns=['a','b', 'c'])
#df.drop(0, axis=0)
df.add(df.loc[0:1,:], fill_value=0)

df1= pd.DataFrame([['Mark', 50], ['Kate', 46]],
                 columns=['name', 'age'])
df2 = pd.DataFrame([['Jon', 3], ['David', 4]],
                columns=['name', 'age'])
pd.concat([df1,df2])

df3 = pd.DataFrame(['writer', 'journalist'], columns=['occupation'])
pd.concat([df1,df3], axis = 1)

Unnamed: 0,name,age,occupation
0,Mark,50,writer
1,Kate,46,journalist


## 06. Indexing, selecting and filtering

In [54]:
# We will work on a subset of the columns
columns = [
    'Mountain', 'Height (m)', 'Range', 'Coordinates', 'Parent mountain',
    'First ascent', 'Ascents bef. 2004', 'Failed attempts bef. 2004'
]
# Load the DataFrame, we will work on the first 10 rows (ten highest mountains)
df = pd.read_csv('data\Mountains.csv', nrows=10, usecols=columns)
df.set_index('Mountain', inplace=True)
df.Range
getattr(df, 'Height (m)')
df[df['Parent mountain'] == 'Mount Everest']

# Mountains with Mount Everest as parent with a first ascent after 1955
df[(df['Parent mountain'] == 'Mount Everest') & (df['First ascent'] > 1955)]

col_criteria = [True, False, False, False, True, True, False]
df.loc[df['Height (m)'] > 8000, col_criteria]

Unnamed: 0_level_0,Height (m),Range,Coordinates,Parent mountain,First ascent,Ascents bef. 2004,Failed attempts bef. 2004
Mountain,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Mount Everest / Sagarmatha / Chomolungma,8848,Mahalangur Himalaya,27°59′17″N 86°55′31″E﻿,,1953,>>145,121
K2 / Qogir / Godwin Austen,8611,Baltoro Karakoram,35°52′53″N 76°30′48″E﻿,Mount Everest,1954,45,44
Kangchenjunga,8586,Kangchenjunga Himalaya,27°42′12″N 88°08′51″E﻿,Mount Everest,1955,38,24
Lhotse,8516,Mahalangur Himalaya,27°57′42″N 86°55′59″E﻿,Mount Everest,1956,26,26
Makalu,8485,Mahalangur Himalaya,27°53′23″N 87°05′20″E﻿,Mount Everest,1955,45,52
Cho Oyu,8188,Mahalangur Himalaya,28°05′39″N 86°39′39″E﻿,Mount Everest,1954,79,28
Dhaulagiri I,8167,Dhaulagiri Himalaya,28°41′48″N 83°29′35″E﻿,K2,1960,51,39
Manaslu,8163,Manaslu Himalaya,28°33′00″N 84°33′35″E﻿,Cho Oyu,1956,49,45
Nanga Parbat,8126,Nanga Parbat Himalaya,35°14′14″N 74°35′21″E﻿,Dhaulagiri,1953,52,67
Annapurna I,8091,Annapurna Himalaya,28°35′44″N 83°49′13″E﻿,Cho Oyu,1950,36,47


## 07. Views vs copies

In [59]:
df = pd.DataFrame( {
    'user' : [1,2,3], 
    'age' : [24,54,17], 
    'sex' : ['F','F','M'], 
    'occupation' : ['technician','musician','student']
})
df2 = df.loc[df.sex=='F'].copy()
df2.loc[0:1,'sex']='Female'

## 08. Applying functions
* map()
* apply()
* applymap()

In [68]:
df = pd.DataFrame( {
    'user': [1, 2, 3],
    'age': [24, 54, 17],
    'sex': ['F', 'F', 'M'],
    'occupation': ['technician', 'musician', 'student']
})
# map : donne un dico, si connait pas le mot ==> NaN
df['sex'] = df['sex'].map({'F': 'Female', 'M': 'Male'})
#replace : remplace la valeur détectée ==> si connait pas le mot ==> valeur pas remplacée
df['sex'].replace('Female', '1')
# apply : applique une méthode de là à de là
df2 = pd.DataFrame(
    data=np.arange(9).reshape(3,3), columns=['a','b', 'c'])
df2.apply(sum, axis=0)
## applymap : applique une fonction et créer une vue (pas opti en terme d'éxécution)
def my_func(x):
    if x > 5:
        size = 'Large'
    elif x >3:
        size = 'Medium'
    else:
        size = 'Small'
    return size
df2.applymap(my_func)

Unnamed: 0,a,b,c
0,Small,Small,Small
1,Small,Medium,Medium
2,Large,Large,Large


## 09. Sorting

In [75]:
df = pd.DataFrame({'A':[3,6,1,12,3],'B':[0,0,7,5,6],'C':[10,4,5,8,2]})
df.sort_index(ascending=False, axis=1)
df.sort_values(['A','C'])

Unnamed: 0,A,B,C
2,1,7,5
4,3,6,2
0,3,0,10
1,6,0,4
3,12,5,8


## 10. Grouping

In [84]:
df = pd.DataFrame({
       'A' : ['dog', 'cat', 'dog', 'cat', 'dog', 'cat', 'dog', 'dog'],
       'B' : ['one', 'one', 'two', 'three','two', 'two', 'one', 'three'],
       'C' : np.random.randint(10, size=8)})

df.groupby('A')['C'].mean()

dfMusic = pd.read_csv('data\songs_requested.csv')
grouped_df = dfMusic.groupby('Musician').agg({'Name': 'count',
                                    'Decade': lambda x: str(min(x))+"-"+str(max(x)), 
                                    'Requested': ['sum', 'max', np.mean]}
                                    )
# renaming some columns
grouped_df.rename(columns={"count":"Total",
                           "<lambda>":"span"})

Unnamed: 0_level_0,Name,Decade,Requested,Requested,Requested
Unnamed: 0_level_1,Total,span,sum,max,mean
Musician,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Bob Dylan,3,60-70,632,297,210.666667
David Bowie,6,60-90,991,194,165.166667
Led Zeppelin,11,60-70,2583,435,234.818182


## 11. Handling missing values

In [91]:
df = pd.DataFrame(np.random.randint(10, size=(3, 3)), index=['a', 'c', 'e'], columns=['A', 'B', 'C'])
df2 = df.reindex(['a', 'b', 'c', 'd', 'e', 'f'])
df2.isnull().sum() ## count null value of column
df2.isnull().sum(axis=1) ## count null value of row

## out null value
df2.dropna() 
# drop a row if it has a missing value in all of the columns
df2.dropna(how='all')
# drop a row if it has a missing value in column 'A'
df2.dropna(subset=['A'])
# drop a row if it has a missing value in column 'A' or column B
df2.dropna(subset=['A','B'])
# drop a row if it has a missing value in both column 'A' and column B
 df2.dropna(subset=['A','B'], how='all')
    
## put data on NaN
df2.fillna(value=0)

a    0
b    3
c    0
d    3
e    0
f    3
dtype: int64