## Load data through numpy mannualy

In [1]:
import numpy as np

In [3]:
X = []

for line in open('data/data_2d.csv'):
    row = line.split(',')
    sample = list(map(float, row))
    X.append(sample)

X = np.array(X)
print(X.shape)

(100, 3)


## Load data using pandas

Let's use pandas to read the above csv file in one line of code!

In [4]:
import pandas as pd

In [5]:
X = pd.read_csv('data/data_2d.csv', header=None)
X.shape

(100, 3)

In [6]:
type(X)

pandas.core.frame.DataFrame

In [7]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 3 columns):
0    100 non-null float64
1    100 non-null float64
2    100 non-null float64
dtypes: float64(3)
memory usage: 2.4 KB


In [8]:
X.head(5)

Unnamed: 0,0,1,2
0,17.930201,94.520592,320.25953
1,97.144697,69.593282,404.634472
2,81.775901,5.737648,181.485108
3,55.854342,70.325902,321.773638
4,49.36655,75.11404,322.465486


## Selecting rows and columns in dataframe

You cannot access data in the same like numpy array, it would give you the following errorss

In [9]:
X[0,0]

KeyError: (0, 0)

In [10]:
# M = X.as_matrix()
M = X.values
type(M)

  """Entry point for launching an IPython kernel.


numpy.ndarray

In [12]:
type(X[0])
# 

pandas.core.series.Series

X[0] gives you the whole column value of column 0, which is a series

If you want to get the values of first row, you need this:

In [13]:
X.iloc[0]

0     17.930201
1     94.520592
2    320.259530
Name: 0, dtype: float64

In [16]:
# select column 1 and 3
X[[0,2]].head()

Unnamed: 0,0,2
0,17.930201,320.25953
1,97.144697,404.634472
2,81.775901,181.485108
3,55.854342,321.773638
4,49.36655,322.465486


In [17]:
# filer certain rows based on the value of one column
X[X[0] < 5]

Unnamed: 0,0,1,2
5,3.192702,29.256299,94.618811
44,3.593966,96.252217,293.237183
54,4.593463,46.335932,145.818745
90,1.382983,84.944087,252.905653
99,4.142669,52.254726,168.034401


### column names

In [19]:
df = pd.read_csv('data/airline.csv', engine='python', skipfooter=3)

In [20]:
df.columns

Index(['Month', 'International airline passengers: monthly totals in thousands. Jan 49 ? Dec 60'], dtype='object')

The column names messy, let's change the column names

In [21]:
df.columns = ['month', 'passengers']
df.columns

Index(['month', 'passengers'], dtype='object')

In [22]:
# access one column values
df.month.head()
# df['month'].head()

0    1949-01
1    1949-02
2    1949-03
3    1949-04
4    1949-05
Name: month, dtype: object

In [23]:
df['new_col'] = 1
df.head()

Unnamed: 0,month,passengers,new_col
0,1949-01,112,1
1,1949-02,118,1
2,1949-03,132,1
3,1949-04,129,1
4,1949-05,121,1


What if we want to **assign a new column value based on the values of other column(s)**? Actually, this is very common in the feature engineering process of a data science project!

Let's use `apply()` in pandas

In [24]:
from datetime import datetime

In [25]:
# axis=1 means apply the function to each row in the dataframe
df['dt'] = df.apply(lambda row: datetime.strptime(row['month'], '%Y-%m'), axis=1)
df.head()

Unnamed: 0,month,passengers,new_col,dt
0,1949-01,112,1,1949-01-01
1,1949-02,118,1,1949-02-01
2,1949-03,132,1,1949-03-01
3,1949-04,129,1,1949-04-01
4,1949-05,121,1,1949-05-01


### join table

In [26]:
t1 = pd.read_csv('data/table1.csv')
t2 = pd.read_csv('data/table2.csv')

In [27]:
t1.head()

Unnamed: 0,user_id,email,age
0,1,alice@gmail.com,20
1,2,bob@gmail.com,25
2,3,carol@gmail.com,30


In [28]:
t2.head()

Unnamed: 0,user_id,ad_id,click
0,1,1,1
1,1,2,0
2,1,5,0
3,2,3,0
4,2,4,1


In [29]:
merge_table = pd.merge(t1, t2, on='user_id')
# t1.merge(t2, on='user_id')
merge_table

Unnamed: 0,user_id,email,age,ad_id,click
0,1,alice@gmail.com,20,1,1
1,1,alice@gmail.com,20,2,0
2,1,alice@gmail.com,20,5,0
3,2,bob@gmail.com,25,3,0
4,2,bob@gmail.com,25,4,1
5,2,bob@gmail.com,25,1,0
6,3,carol@gmail.com,30,2,0
7,3,carol@gmail.com,30,1,0
8,3,carol@gmail.com,30,3,0
9,3,carol@gmail.com,30,4,0
