# *Pandas notebook*

## Needed libraries

In [1]:
import numpy as np
import pandas as pd
#library for useful os operations
import os

## basic useful os operations

In [2]:
#os.listdir()
!ls

notebook2.ipynb		 pandas_exercises-master.zip  sklearn.ipynb
notebook.ipynb		 result.csv		      titanic
pandas_exercises-master  result.zip		      titanic.zip


In [3]:
#!ls titanic
os.listdir("titanic")

['test.csv', 'train.csv', 'gender_submission.csv']

## Introduction to DataFrames, Series and their most useful functionalities

### DataFrames

DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object. 

* example of dataframe constructor:

**data** : ndarray (structured or homogeneous), Iterable, dict, or DataFrame: the data to enter

**index** : Index or array-like: the index of the dataframe

**columns** : Index or array-like: columns of the dataframe

**dtype**: force type


In [4]:
pd.DataFrame(data={'nom': ["Mokhtar", "Anas"],
                   'age': [21, 17]},
             index=["id1","id2"])

Unnamed: 0,nom,age
id1,Mokhtar,21
id2,Anas,17


### Series

Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index.

* example of Series constructor:

**data** : ndarray (structured or homogeneous), Iterable, dict: the data to enter
        
**index** : Index or array-like: the index of the dataframe
        
**dtype**: force type
    
        

In [5]:
pd.Series(data=[1,2,3],index=["id1","id2","id3"])

id1    1
id2    2
id3    3
dtype: int64

## Useful functionalities before starting

In [6]:
X=pd.DataFrame(data={'nom': ["Mokhtar", "Anas"],
                     'age': [21, 17]},
               index=["id1","id2"])
X

Unnamed: 0,nom,age
id1,Mokhtar,21
id2,Anas,17


### head (and tail)

* head (respectively tail) function: display the first (respectively last) elements of train

**n** : int, default 5: number of lignes to take into consideration



In [7]:
X.head()
#X.tail(10)

Unnamed: 0,nom,age
id1,Mokhtar,21
id2,Anas,17


### shape

In [8]:
#shape return the shape of the DataFrame (respectively the Series)
X.shape

(2, 2)

In [9]:
X["nom"].shape

(2,)

### copy

copy DataFrame (respectively Series) content

In [10]:
test=X.copy()
test.head()

Unnamed: 0,nom,age
id1,Mokhtar,21
id2,Anas,17


### assigning values

In [11]:
test=X.copy()
test["nom"]="nom"
test.head()

Unnamed: 0,nom,age
id1,nom,21
id2,nom,17


In [12]:
test["age"]=[30,17]
test.head()

Unnamed: 0,nom,age
id1,nom,30
id2,nom,17


### Deleting features

**labels** : single label or list-like

    Index or column labels to drop.
**axis** : {0 or ‘index’, 1 or ‘columns’}, default 0

    Whether to drop labels from the index (0 or ‘index’) or columns (1 or ‘columns’).
**index** : single label or list-like

    Alternative to specifying axis (labels, axis=0 is equivalent to index=labels).
**columns** : single label or list-like

    Alternative to specifying axis (labels, axis=1 is equivalent to columns=labels).
**level** : int or level name, optional

    For MultiIndex, level from which the labels will be removed.
**inplace** : bool, default False

    If True, do operation inplace and return None.


In [13]:
X.drop("age",axis=1)

Unnamed: 0,nom
id1,Mokhtar
id2,Anas


In [14]:
X.drop("id1",axis=0)

Unnamed: 0,nom,age
id2,Anas,17


## Input and Output

### Input

* read train test and submission files with pandas with read_csv

**filepath_or_buffer** : str, path object or file-like object

**sep** : str, default ‘,’

**names** : array-like, optional:list of columns if there is no header

**index_col** : int, str, sequence of int / str, or False, default None

**dtype**

**parse_dates** : bool or list of int or names or list of lists or dict, default False: treat date features


In [15]:
X_train=pd.read_csv("titanic/train.csv"
                    ,index_col="PassengerId")
X_test=pd.read_csv("titanic/test.csv"
                   ,index_col="PassengerId")
Submission=pd.read_csv("titanic/gender_submission.csv")

In [16]:
#display the first 5 elements of train
X_train.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [17]:
#display the first n elements of test
X_test.head(2)

Unnamed: 0_level_0,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S


In [18]:
#display the last n elements of train
Submission.tail(1)

Unnamed: 0,PassengerId,Survived
417,1309,0


### Output

* Write Dataframe in file (Same for series)

**filepath_or_buffer** : str, path object or file-like object

**sep** : str, default ‘,’

**na_rep** : str, default ‘’: missing data representation

**index** : bool, default True


In [19]:
Submission.to_csv("result.csv",index=False)
#Submission.to_csv("result.zip",index=False) for compression (zip format)

## Selection

### Naive accessor

In [20]:
X_train["Name"].head()

PassengerId
1                              Braund, Mr. Owen Harris
2    Cumings, Mrs. John Bradley (Florence Briggs Th...
3                               Heikkinen, Miss. Laina
4         Futrelle, Mrs. Jacques Heath (Lily May Peel)
5                             Allen, Mr. William Henry
Name: Name, dtype: object

In [21]:
X_train.Name.head()

PassengerId
1                              Braund, Mr. Owen Harris
2    Cumings, Mrs. John Bradley (Florence Briggs Th...
3                               Heikkinen, Miss. Laina
4         Futrelle, Mrs. Jacques Heath (Lily May Peel)
5                             Allen, Mr. William Henry
Name: Name, dtype: object

In [22]:
X_train["Name"][1]

'Braund, Mr. Owen Harris'

### Indexing

#### index-based indexing: iloc

* row indexing

In [23]:
#scalar integer: Series
X_train.iloc[0]

Survived                          0
Pclass                            3
Name        Braund, Mr. Owen Harris
Sex                            male
Age                              22
SibSp                             1
Parch                             0
Ticket                    A/5 21171
Fare                           7.25
Cabin                           NaN
Embarked                          S
Name: 1, dtype: object

In [24]:
#With a list of integers: DataFrame
X_train.iloc[[0,2,3]]

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S


In [25]:
#with a mask
X_train.iloc[[True, False, True]]

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S


* both axes indexing

In [26]:
#same properties but with both axes
X_train.iloc[2:12:3, [1, 3]]

Unnamed: 0_level_0,Pclass,Sex
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1
3,3,female
6,3,male
9,3,female
12,1,female


#### Label-based selection: loc

The same functionality as iloc but work with the indexes and columns of the DataFrame

In [27]:
X_train.loc[1:4,["Survived","Age"]]

Unnamed: 0_level_0,Survived,Age
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,0,22.0
2,1,38.0
3,1,26.0
4,1,35.0


* Main difference is that:

iloc indexing scheme: **exclusive** (same as python)

loc indexing scheme: **inclusive** 

### Conditional selection

In [28]:
(X_train["Sex"]=="female").head()

PassengerId
1    False
2     True
3     True
4     True
5    False
Name: Sex, dtype: bool

In [29]:
#X_train.loc[X_train["Sex"]=="female"].head()
X_train[X_train["Sex"]=="female"].head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


## Summary functions

### describe

 high-level summary of the attributes of the given column. 
 
 It is type-aware.

In [30]:
X_train.describe()

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,714.0,891.0,891.0,891.0
mean,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,0.0,1.0,0.42,0.0,0.0,0.0
25%,0.0,2.0,20.125,0.0,0.0,7.9104
50%,0.0,3.0,28.0,0.0,0.0,14.4542
75%,1.0,3.0,38.0,1.0,0.0,31.0
max,1.0,3.0,80.0,8.0,6.0,512.3292


getting some particular simple summary statistic about a column in a DataFrame

In [31]:
X_train.count()

Survived    891
Pclass      891
Name        891
Sex         891
Age         714
SibSp       891
Parch       891
Ticket      891
Fare        891
Cabin       204
Embarked    889
dtype: int64

### unique

To see a list of unique values

In [32]:
X_train["Sex"].unique()

array(['male', 'female'], dtype=object)

### value_counts

To see a list of unique values and how often they occur in the dataset

In [33]:
X_train["Survived"].value_counts()

0    549
1    342
Name: Survived, dtype: int64

##  Mapping

Transforming data from the format it is in now to the format that we want it to be in later

### map function

The function you pass to map() should expect a single value from the Series (a point value, in the above example), and return a transformed version of that value. 

map() returns a new Series where all the values have been transformed by your function.

In [34]:
X_train["Age"].mean()

29.69911764705882

In [35]:
X_train["Age"].map(lambda x:x-X_train["Age"].mean()).head()

PassengerId
1   -7.699118
2    8.300882
3   -3.699118
4    5.300882
5    5.300882
Name: Age, dtype: float64

### apply function

apply() is the equivalent method of **map** if we want to transform a whole DataFrame by calling a custom method on each row.

In [None]:
def log(col):
    col["id1"]=np.nan
    return col

X_train.apply(log ,axis="index").head()

In [36]:
def log(row):
    #row["Age"] = np.log(row["Age"])
    row["Age"] = row["Age"]+10
    return row
                        
X_train.apply(log ,axis="columns").head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,3.091042,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,3.637586,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,3.258097,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,3.555348,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,3.555348,0,0,373450,8.05,,S


If we had called X_train.apply() with axis='index',then instead of passing a function to transform each row, we would need to give a function to transform each column.

## Grouping

Often we want to group our data, and then do something specific to the group the data is in.

groupby() feature provide efficient data grouping.

This function takes some column name or names and splits the dataframe up into
chunks based on those names, it returns a dataframe group by object.

### Steps of a groupby operation:
* **Splitting** the data into groups based on some criteria.
* **Applying** a function to each group independently.
* **Combining** the results into a data structure.


### applied operations

**Aggregation**: compute a summary statistic (or statistics) for each group. Some examples:

* Compute group sums or means.
* Compute group sizes / counts.

**Transformation**: perform some group-specific computations and return a like-indexed object. Some examples:

* Standardize data within a group.
* Filling NAs within groups with a value derived from each group.

**Filtration**: discard some groups, according to a group-wise computation that evaluates True or False. Some examples:

* Discard data that belongs to groups with only a few members.
* Filter out data based on the group sum or mean.




### Groupwise analysis

value_counts is actually a groupby operation :

In [37]:
X_train.groupby("Survived")["Survived"].count()

Survived
0    549
1    342
Name: Survived, dtype: int64

In [38]:
#no "Survived" feature
X_train.groupby("Survived").count()

Unnamed: 0_level_0,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
Survived,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
0,549,549,549,424,549,549,549,549,68,549
1,342,342,342,290,342,342,342,342,136,340


We can use ***groupby*** and ***agg*** to apply many different functions

In [39]:
X_train.groupby("Survived")["Age"].agg(["max","min","mean","median"])

Unnamed: 0_level_0,max,min,mean,median
Survived,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,74.0,1.0,30.626179,28.0
1,80.0,0.42,28.34369,28.0


In [40]:
X_train.groupby("Survived")["Age"].filter(lambda x:x.mean()>29).head(10)

PassengerId
1     22.0
5     35.0
6      NaN
7     54.0
8      2.0
13    20.0
14    39.0
15    14.0
17     2.0
19    31.0
Name: Age, dtype: float64

In [41]:
X_train.groupby("                                                                                                                           Survived")["Age"].transform(lambda x:x-x.mean()).head(10)

PassengerId
1     -8.626179
2      9.656310
3     -2.343690
4      6.656310
5      4.373821
6           NaN
7     23.373821
8    -28.626179
9     -1.343690
10   -14.343690
Name: Age, dtype: float64

### Multi-indexes group by

Depending on the operation we run, group by will sometimes result in what is called a multi-index.

In [42]:
X_train.groupby(["Survived","Sex"]).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,Pclass,Age,SibSp,Parch,Fare
Survived,Sex,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,female,2.851852,25.046875,1.209877,1.037037,23.024385
0,male,2.476496,31.618056,0.440171,0.207265,21.960993
1,female,1.918455,28.847716,0.515021,0.515021,51.938573
1,male,2.018349,27.276022,0.385321,0.357798,40.821484


- to read more about grouping: [Group By: split-apply-combine](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html)
- for advanced multi indexing:
[pandas: MultiIndex / advanced indexing]( https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html )

## Sorting 

To get data in the order want it in we can sort it ourselves.

### Sort values

Main args

**by** : str or list of str: column to sort with

**axis** : {0 or ‘index’, 1 or ‘columns’}, default 0

**ascending** : bool or list of bool, default True

**inplace** : bool, default False: change the DataFrame or return DataFrame object

**kind** : {‘quicksort’, ‘mergesort’, ‘heapsort’}, default ‘quicksort’

**na_position** : {‘first’, ‘last’}, default ‘last’

In [43]:
X_train.sort_values(by="Survived",ascending=False).head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
446,1,1,"Dodge, Master. Washington",male,4.0,0,2,33638,81.8583,A34,S
320,1,1,"Spedden, Mrs. Frederic Oakley (Margaretta Corn...",female,40.0,1,1,16966,134.5,E34,C
335,1,1,"Frauenthal, Mrs. Henry William (Clara Heinshei...",female,,1,0,PC 17611,133.65,,S
331,1,3,"McCoy, Miss. Agnes",female,,2,0,367226,23.25,,Q
330,1,1,"Hippach, Miss. Jean Gertrude",female,16.0,0,1,111361,57.9792,B18,C


## Combining

When performing operations on a dataset, we will sometimes need to combine different DataFrames and/or Series in non-trivial ways. Pandas has three core methods for doing this. In order of increasing complexity, these are concat(), join(), and merge().

Most of what merge() can do can also be done more simply with join(). So we are going to work mainly on concat() and join()

### concat()

Given a list of elements, this function will smush those elements together along an *axis*.

**objs** : a sequence or mapping of Series or DataFrame objects

**axis** : {0/’index’, 1/’columns’}, default 0

**join** : {‘inner’, ‘outer’}, default ‘outer’: How to handle indexes on other axis (or axes).

    outer: form union of calling frame’s index (or column if on is specified) with other’s index, and sort it (changing soon). lexicographically.
    inner: form intersection of calling frame’s index (or column if on is specified) with other’s index, preserving the order of the calling’s one.


**ignore_index** : bool, default False:ignore the existing indexes while concating

**sort** : bool, default None: Sort the resulting DataFrame


In [44]:
pd.concat([X_train,X_test],axis=0,join="inner").head(10)
#no Survived feature

Unnamed: 0_level_0,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
6,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
7,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
8,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
9,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
10,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


In [45]:
pd.concat([X_train,X_test],axis=0,join="outer").head(10)
#With survived feature filling test rows with nans

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  """Entry point for launching an IPython kernel.


Unnamed: 0_level_0,Age,Cabin,Embarked,Fare,Name,Parch,Pclass,Sex,SibSp,Survived,Ticket
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,22.0,,S,7.25,"Braund, Mr. Owen Harris",0,3,male,1,0.0,A/5 21171
2,38.0,C85,C,71.2833,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0,1,female,1,1.0,PC 17599
3,26.0,,S,7.925,"Heikkinen, Miss. Laina",0,3,female,0,1.0,STON/O2. 3101282
4,35.0,C123,S,53.1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",0,1,female,1,1.0,113803
5,35.0,,S,8.05,"Allen, Mr. William Henry",0,3,male,0,0.0,373450
6,,,Q,8.4583,"Moran, Mr. James",0,3,male,0,0.0,330877
7,54.0,E46,S,51.8625,"McCarthy, Mr. Timothy J",0,1,male,0,0.0,17463
8,2.0,,S,21.075,"Palsson, Master. Gosta Leonard",1,3,male,3,0.0,349909
9,27.0,,S,11.1333,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",2,3,female,0,1.0,347742
10,14.0,,C,30.0708,"Nasser, Mrs. Nicholas (Adele Achem)",0,2,female,1,1.0,237736


In [46]:
print("X_train.shape: ",X_train.shape)
print("X_test.shape: ",X_test.shape)
print("concat shape: ",pd.concat([X_train,X_test],axis=0,join="outer").shape)

X_train.shape:  (891, 11)
X_test.shape:  (418, 10)
concat shape:  (1309, 11)


of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  This is separate from the ipykernel package so we can avoid doing imports until


### join()

join() lets you combine different DataFrame objects which have an index in common


**other** : DataFrame, Series, or list of DataFrame: the other Dataframe to join

**on** : str, list of str, or array-like, optional: Column or index level name(s) in the caller to join on the index in other, otherwise joins index-on-index. 

**how** : {‘left’, ‘right’, ‘outer’, ‘inner’}, default ‘left’

    How to handle the operation of the two objects.

        left: use calling frame’s index (or column if on is specified)
        right: use other’s index.
       
**lsuffix** : str, default ‘’: Suffix to use from left frame’s overlapping columns.

**rsuffix** : str, default ‘’: Suffix to use from right frame’s overlapping columns.

**sort** : bool, default False:    Order result DataFrame lexicographically by the join key.


In [47]:
X1=X_train[["Survived","Fare"]]
X1.head()

Unnamed: 0_level_0,Survived,Fare
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,0,7.25
2,1,71.2833
3,1,7.925
4,1,53.1
5,0,8.05


In [48]:
X2=X_train[["Survived","Age"]]
X2.head()

Unnamed: 0_level_0,Survived,Age
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,0,22.0
2,1,38.0
3,1,26.0
4,1,35.0
5,0,35.0


In [49]:
X1.join(X2,lsuffix='_1', rsuffix='_2').head()

Unnamed: 0_level_0,Survived_1,Fare,Survived_2,Age
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,0,7.25,0,22.0
2,1,71.2833,1,38.0
3,1,7.925,1,26.0
4,1,53.1,1,35.0
5,0,8.05,0,35.0


In [50]:
x1=X1.reset_index()
x2=X2.reset_index().head(200)
x1.drop("PassengerId",axis=1,inplace=True)
x2.drop("PassengerId",axis=1,inplace=True)

In [51]:
x1.head()

Unnamed: 0,Survived,Fare
0,0,7.25
1,1,71.2833
2,1,7.925
3,1,53.1
4,0,8.05


In [52]:
x2.head()

Unnamed: 0,Survived,Age
0,0,22.0
1,1,38.0
2,1,26.0
3,1,35.0
4,0,35.0


In [53]:
inner=x1.join(x2,lsuffix='_1', rsuffix='_2',how="inner")
inner.head()

Unnamed: 0,Survived_1,Fare,Survived_2,Age
0,0,7.25,0,22.0
1,1,71.2833,1,38.0
2,1,7.925,1,26.0
3,1,53.1,1,35.0
4,0,8.05,0,35.0


In [54]:
outer=x1.join(x2,lsuffix='_1', rsuffix='_2',how="outer")
outer.head()


Unnamed: 0,Survived_1,Fare,Survived_2,Age
0,0,7.25,0.0,22.0
1,1,71.2833,1.0,38.0
2,1,7.925,1.0,26.0
3,1,53.1,1.0,35.0
4,0,8.05,0.0,35.0


In [55]:
print("x1 shape: ",x1.shape)
print("x2 shape: ",x2.shape)
print("inner join shape: ",inner.shape)
print("outer join shape: ",outer.shape)


x1 shape:  (891, 2)
x2 shape:  (200, 2)
inner join shape:  (200, 4)
outer join shape:  (891, 4)


for more about **concat()** **merge()** and **join()**: [Merge, join, and concatenate](https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html) 

## More ressources

[pandas exercices ](https://github.com/guipsamora/pandas_exercises) : full credits to [guipsamora](https://github.com/guipsamora)