## Pandas act just like Excel File

<b>.What Is Pandas In Python?</b>
<br>
Pandas is an open source Python package that is most widely used for data science/data analysis and machine learning tasks. It is built on top of another package named Numpy, which provides support for multi-dimensional arrays. As one of the most popular data wrangling packages, Pandas works well with many other data science modules inside the Python ecosystem, and is typically included in every Python distribution, from those that come with your operating system to commercial vendor distributions like ActiveState’s ActivePython. 
<br><br>
<b>.What Can You Do With DataFrames Using Pandas?</b>
<br>
Pandas makes it simple to do many of the time consuming, repetitive tasks associated with working with data, including:
<br>
Data cleansing,
Data fill,
Data normalization,
Merges and joins,
Data visualization,
Statistical analysis,
Data inspection,
Loading and saving data,
And much more
<br>
In fact, with Pandas, you can do everything that makes world-leading data scientists vote Pandas as the best data analysis and manipulation tool available.

In [103]:
import pandas as pd

import numpy as np
# In[101]

In [8]:
iris = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data")
print(type(iris))
iris

# CSV is a dataframe, 2D table with rows and cols
# the first line is automatically considered as the header

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,5.1,3.5,1.4,0.2,Iris-setosa
0,4.9,3.0,1.4,0.2,Iris-setosa
1,4.7,3.2,1.3,0.2,Iris-setosa
2,4.6,3.1,1.5,0.2,Iris-setosa
3,5.0,3.6,1.4,0.2,Iris-setosa
4,5.4,3.9,1.7,0.4,Iris-setosa
...,...,...,...,...,...
144,6.7,3.0,5.2,2.3,Iris-virginica
145,6.3,2.5,5.0,1.9,Iris-virginica
146,6.5,3.0,5.2,2.0,Iris-virginica
147,6.2,3.4,5.4,2.3,Iris-virginica


In [9]:
df = iris.copy()
# made a copy of object iris,  any change in df will not reflect in iris.


# while in df = iris, the address of iris is copied, any change in df will reflect in iris.

In [10]:
df.head() # usually it shows the first 5 entries

Unnamed: 0,5.1,3.5,1.4,0.2,Iris-setosa
0,4.9,3.0,1.4,0.2,Iris-setosa
1,4.7,3.2,1.3,0.2,Iris-setosa
2,4.6,3.1,1.5,0.2,Iris-setosa
3,5.0,3.6,1.4,0.2,Iris-setosa
4,5.4,3.9,1.7,0.4,Iris-setosa


In [11]:
df.head(2)

Unnamed: 0,5.1,3.5,1.4,0.2,Iris-setosa
0,4.9,3.0,1.4,0.2,Iris-setosa
1,4.7,3.2,1.3,0.2,Iris-setosa


In [13]:
df.shape

(149, 5)

In [14]:
df.dtypes

5.1            float64
3.5            float64
1.4            float64
0.2            float64
Iris-setosa     object
dtype: object

In [17]:
# change the column headers as they aren't correct

print(df.columns) # prev headers
# let's change
df.columns=['sl', 'sw', 'pl', 'pw', 'flower']
print(df.columns) # new header

Index(['5.1', '3.5', '1.4', '0.2', 'Iris-setosa'], dtype='object')
Index(['sl', 'sw', 'pl', 'pw', 'flower'], dtype='object')


In [19]:
df.dtypes

sl        float64
sw        float64
pl        float64
pw        float64
flower     object
dtype: object

In [20]:
df.head(3)

Unnamed: 0,sl,sw,pl,pw,flower
0,4.9,3.0,1.4,0.2,Iris-setosa
1,4.7,3.2,1.3,0.2,Iris-setosa
2,4.6,3.1,1.5,0.2,Iris-setosa


In [21]:
df.describe()

Unnamed: 0,sl,sw,pl,pw
count,149.0,149.0,149.0,149.0
mean,5.848322,3.051007,3.774497,1.205369
std,0.828594,0.433499,1.759651,0.761292
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.4,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


In [23]:
df.sl # OR df['sl']

0      4.9
1      4.7
2      4.6
3      5.0
4      5.4
      ... 
144    6.7
145    6.3
146    6.5
147    6.2
148    5.9
Name: sl, Length: 149, dtype: float64

In [25]:
df.isnull()

Unnamed: 0,sl,sw,pl,pw,flower
0,False,False,False,False,False
1,False,False,False,False,False
2,False,False,False,False,False
3,False,False,False,False,False
4,False,False,False,False,False
...,...,...,...,...,...
144,False,False,False,False,False
145,False,False,False,False,False
146,False,False,False,False,False
147,False,False,False,False,False


In [27]:
df.isnull().sum() # no null entries

sl        0
sw        0
pl        0
pw        0
flower    0
dtype: int64

In [30]:
df.iloc[1:4, 2:5] # similiar to slicing in 2D Lists

Unnamed: 0,pl,pw,flower
1,1.3,0.2,Iris-setosa
2,1.5,0.2,Iris-setosa
3,1.4,0.2,Iris-setosa


In [35]:
# Let's remove the 0th labelled row

a=df.drop(0) # makes a copy of df, no actual changes made in df
a.head()
# 0th labelled row is vanished

Unnamed: 0,sl,sw,pl,pw,flower
1,4.7,3.2,1.3,0.2,Iris-setosa
2,4.6,3.1,1.5,0.2,Iris-setosa
3,5.0,3.6,1.4,0.2,Iris-setosa
4,5.4,3.9,1.7,0.4,Iris-setosa
5,4.6,3.4,1.4,0.3,Iris-setosa


In [40]:
df = iris.copy()
df.drop(1, inplace=True) # actual changes made in df
df.head()

Unnamed: 0,5.1,3.5,1.4,0.2,Iris-setosa
0,4.9,3.0,1.4,0.2,Iris-setosa
2,4.6,3.1,1.5,0.2,Iris-setosa
3,5.0,3.6,1.4,0.2,Iris-setosa
4,5.4,3.9,1.7,0.4,Iris-setosa
5,4.6,3.4,1.4,0.3,Iris-setosa


<i>drop function takes a label and not position</i>

In [42]:
df.index # no label 1 present

Int64Index([  0,   2,   3,   4,   5,   6,   7,   8,   9,  10,
            ...
            139, 140, 141, 142, 143, 144, 145, 146, 147, 148],
           dtype='int64', length=148)

In [47]:
# but, if we want to drop by position...
df=iris.copy()
df.drop(df.index[0], inplace=True)
df.head()

Unnamed: 0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.7,3.2,1.3,0.2,Iris-setosa
2,4.6,3.1,1.5,0.2,Iris-setosa
3,5.0,3.6,1.4,0.2,Iris-setosa
4,5.4,3.9,1.7,0.4,Iris-setosa
5,4.6,3.4,1.4,0.3,Iris-setosa


In [48]:
df=iris.copy()
df.drop(df.index[0], inplace=True)
df.drop(df.index[0], inplace=True)
df.head()

Unnamed: 0,5.1,3.5,1.4,0.2,Iris-setosa
2,4.6,3.1,1.5,0.2,Iris-setosa
3,5.0,3.6,1.4,0.2,Iris-setosa
4,5.4,3.9,1.7,0.4,Iris-setosa
5,4.6,3.4,1.4,0.3,Iris-setosa
6,5.0,3.4,1.5,0.2,Iris-setosa


In [55]:
df=iris.copy()
df.drop(df.index[[0, 3, 1]], inplace=True)
df.head()

Unnamed: 0,5.1,3.5,1.4,0.2,Iris-setosa
2,4.6,3.1,1.5,0.2,Iris-setosa
4,5.4,3.9,1.7,0.4,Iris-setosa
5,4.6,3.4,1.4,0.3,Iris-setosa
6,5.0,3.4,1.5,0.2,Iris-setosa
7,4.4,2.9,1.4,0.2,Iris-setosa


In [60]:
df=iris.copy()
df.columns=['sl', 'sw', 'pl', 'pw', 'flower']
df.sl>5 # if sepal length > 5

0      False
1      False
2      False
3      False
4       True
       ...  
144     True
145     True
146     True
147     True
148     True
Name: sl, Length: 149, dtype: bool

In [61]:
df=iris.copy()
df.columns=['sl', 'sw', 'pl', 'pw', 'flower']
df[df.sl>5] # prints only those rows where condition is true

Unnamed: 0,sl,sw,pl,pw,flower
4,5.4,3.9,1.7,0.4,Iris-setosa
9,5.4,3.7,1.5,0.2,Iris-setosa
13,5.8,4.0,1.2,0.2,Iris-setosa
14,5.7,4.4,1.5,0.4,Iris-setosa
15,5.4,3.9,1.3,0.4,Iris-setosa
...,...,...,...,...,...
144,6.7,3.0,5.2,2.3,Iris-virginica
145,6.3,2.5,5.0,1.9,Iris-virginica
146,6.5,3.0,5.2,2.0,Iris-virginica
147,6.2,3.4,5.4,2.3,Iris-virginica


In [64]:
df=iris.copy()
df.columns=['sl', 'sw', 'pl', 'pw', 'flower']
df[df.flower=='Iris-virginica']

Unnamed: 0,sl,sw,pl,pw,flower
99,6.3,3.3,6.0,2.5,Iris-virginica
100,5.8,2.7,5.1,1.9,Iris-virginica
101,7.1,3.0,5.9,2.1,Iris-virginica
102,6.3,2.9,5.6,1.8,Iris-virginica
103,6.5,3.0,5.8,2.2,Iris-virginica
104,7.6,3.0,6.6,2.1,Iris-virginica
105,4.9,2.5,4.5,1.7,Iris-virginica
106,7.3,2.9,6.3,1.8,Iris-virginica
107,6.7,2.5,5.8,1.8,Iris-virginica
108,7.2,3.6,6.1,2.5,Iris-virginica


In [65]:
df=iris.copy()
df.columns=['sl', 'sw', 'pl', 'pw', 'flower']
df[df.flower=='Iris-virginica'].describe() # provides information about 'iris-virginica' flower

Unnamed: 0,sl,sw,pl,pw
count,50.0,50.0,50.0,50.0
mean,6.588,2.974,5.552,2.026
std,0.63588,0.322497,0.551895,0.27465
min,4.9,2.2,4.5,1.4
25%,6.225,2.8,5.1,1.8
50%,6.5,3.0,5.55,2.0
75%,6.9,3.175,5.875,2.3
max,7.9,3.8,6.9,2.5


In [70]:
df=iris.copy()
df.columns=['sl', 'sw', 'pl', 'pw', 'flower']

# 'iloc' is a positional based operator, not label based
# whereas 'loc' is a label based operator

df.drop(0, inplace=True)
print(df.head)
print(df.iloc[0])
print(df.loc[1])

# both will print the same thing

<bound method NDFrame.head of       sl   sw   pl   pw          flower
1    4.7  3.2  1.3  0.2     Iris-setosa
2    4.6  3.1  1.5  0.2     Iris-setosa
3    5.0  3.6  1.4  0.2     Iris-setosa
4    5.4  3.9  1.7  0.4     Iris-setosa
5    4.6  3.4  1.4  0.3     Iris-setosa
..   ...  ...  ...  ...             ...
144  6.7  3.0  5.2  2.3  Iris-virginica
145  6.3  2.5  5.0  1.9  Iris-virginica
146  6.5  3.0  5.2  2.0  Iris-virginica
147  6.2  3.4  5.4  2.3  Iris-virginica
148  5.9  3.0  5.1  1.8  Iris-virginica

[148 rows x 5 columns]>
sl                4.7
sw                3.2
pl                1.3
pw                0.2
flower    Iris-setosa
Name: 1, dtype: object
sl                4.7
sw                3.2
pl                1.3
pw                0.2
flower    Iris-setosa
Name: 1, dtype: object


In [74]:
# add a row

df=iris.copy()
df.columns=['sl', 'sw', 'pl', 'pw', 'flower']

df.drop(0, inplace=True)
# we deleted label 0

# df.loc[label_name]
df.loc[0]=[1, 2, 3, 4, 'Iris-setosa']

print(df.head())
print()
print(df.tail())

# as we can see, it's added at the tail

    sl   sw   pl   pw       flower
1  4.7  3.2  1.3  0.2  Iris-setosa
2  4.6  3.1  1.5  0.2  Iris-setosa
3  5.0  3.6  1.4  0.2  Iris-setosa
4  5.4  3.9  1.7  0.4  Iris-setosa
5  4.6  3.4  1.4  0.3  Iris-setosa

      sl   sw   pl   pw          flower
145  6.3  2.5  5.0  1.9  Iris-virginica
146  6.5  3.0  5.2  2.0  Iris-virginica
147  6.2  3.4  5.4  2.3  Iris-virginica
148  5.9  3.0  5.1  1.8  Iris-virginica
0    1.0  2.0  3.0  4.0     Iris-setosa


In [78]:
# add a row

df=iris.copy()
df.columns=['sl', 'sw', 'pl', 'pw', 'flower']

df.drop(0, inplace=True)
# we deleted label 0

# df.loc[label_name]
df.loc[149]=[1, 2, 3, 4, 'Iris-setosa'] # important step

print(df.head())
print()
print(df.tail())

# now, its added at the last

    sl   sw   pl   pw       flower
1  4.7  3.2  1.3  0.2  Iris-setosa
2  4.6  3.1  1.5  0.2  Iris-setosa
3  5.0  3.6  1.4  0.2  Iris-setosa
4  5.4  3.9  1.7  0.4  Iris-setosa
5  4.6  3.4  1.4  0.3  Iris-setosa

      sl   sw   pl   pw          flower
145  6.3  2.5  5.0  1.9  Iris-virginica
146  6.5  3.0  5.2  2.0  Iris-virginica
147  6.2  3.4  5.4  2.3  Iris-virginica
148  5.9  3.0  5.1  1.8  Iris-virginica
149  1.0  2.0  3.0  4.0     Iris-setosa


In [79]:
# since our data frame has haphazard indices, we will rearrange it

df.reset_index()

Unnamed: 0,index,sl,sw,pl,pw,flower
0,1,4.7,3.2,1.3,0.2,Iris-setosa
1,2,4.6,3.1,1.5,0.2,Iris-setosa
2,3,5.0,3.6,1.4,0.2,Iris-setosa
3,4,5.4,3.9,1.7,0.4,Iris-setosa
4,5,4.6,3.4,1.4,0.3,Iris-setosa
...,...,...,...,...,...,...
144,145,6.3,2.5,5.0,1.9,Iris-virginica
145,146,6.5,3.0,5.2,2.0,Iris-virginica
146,147,6.2,3.4,5.4,2.3,Iris-virginica
147,148,5.9,3.0,5.1,1.8,Iris-virginica


In [93]:
# but, now, we have another column, which we do not want
# so we have to delete it

df=iris.copy()
df.columns=['sl', 'sw', 'pl', 'pw', 'flower']

df.drop(0, inplace=True)
# we deleted label 0

# df.loc[label_name]
df.loc[149]=[1, 2, 3, 4, 'Iris-setosa'] # important step

# now, its added at the last
df.reset_index(drop = True, inplace = True) # drop removes the extra col. and inplace makes change in actual df

print(df.head())
print()
print(df.tail())

    sl   sw   pl   pw       flower
0  4.7  3.2  1.3  0.2  Iris-setosa
1  4.6  3.1  1.5  0.2  Iris-setosa
2  5.0  3.6  1.4  0.2  Iris-setosa
3  5.4  3.9  1.7  0.4  Iris-setosa
4  4.6  3.4  1.4  0.3  Iris-setosa

      sl   sw   pl   pw          flower
144  6.3  2.5  5.0  1.9  Iris-virginica
145  6.5  3.0  5.2  2.0  Iris-virginica
146  6.2  3.4  5.4  2.3  Iris-virginica
147  5.9  3.0  5.1  1.8  Iris-virginica
148  1.0  2.0  3.0  4.0     Iris-setosa


In [97]:
# delete column
df=iris.copy()
df.columns=['sl', 'sw', 'pl', 'pw', 'flower']
df.drop('sl', axis = 1, inplace = True)
df.head(15)

# axis 1 means column

Unnamed: 0,sw,pl,pw,flower
0,3.0,1.4,0.2,Iris-setosa
1,3.2,1.3,0.2,Iris-setosa
2,3.1,1.5,0.2,Iris-setosa
3,3.6,1.4,0.2,Iris-setosa
4,3.9,1.7,0.4,Iris-setosa
5,3.4,1.4,0.3,Iris-setosa
6,3.4,1.5,0.2,Iris-setosa
7,2.9,1.4,0.2,Iris-setosa
8,3.1,1.5,0.1,Iris-setosa
9,3.7,1.5,0.2,Iris-setosa


In [98]:
df.describe()

Unnamed: 0,sw,pl,pw
count,149.0,149.0,149.0
mean,3.051007,3.774497,1.205369
std,0.433499,1.759651,0.761292
min,2.0,1.0,0.1
25%,2.8,1.6,0.3
50%,3.0,4.4,1.3
75%,3.3,5.1,1.8
max,4.4,6.9,2.5


In [99]:
# another way

del df['sw']
df.head(15)

Unnamed: 0,pl,pw,flower
0,1.4,0.2,Iris-setosa
1,1.3,0.2,Iris-setosa
2,1.5,0.2,Iris-setosa
3,1.4,0.2,Iris-setosa
4,1.7,0.4,Iris-setosa
5,1.4,0.3,Iris-setosa
6,1.5,0.2,Iris-setosa
7,1.4,0.2,Iris-setosa
8,1.5,0.1,Iris-setosa
9,1.5,0.2,Iris-setosa


In [100]:
df=iris.copy()
df.columns=['sl', 'sw', 'pl', 'pw', 'flower']

# add a new column
# diff. betw. sepal len and petal len

df["diff_pl_sl"]=df["pl"]-df["pw"]
df.head()

Unnamed: 0,sl,sw,pl,pw,flower,diff_pl_sl
0,4.9,3.0,1.4,0.2,Iris-setosa,1.2
1,4.7,3.2,1.3,0.2,Iris-setosa,1.1
2,4.6,3.1,1.5,0.2,Iris-setosa,1.3
3,5.0,3.6,1.4,0.2,Iris-setosa,1.2
4,5.4,3.9,1.7,0.4,Iris-setosa,1.3


<b>our dataframe may have some data entries equal to NaN or NULL in layman terms<br>so, we have 2 options:<br>1. change the data entry(fill)<br>2. discard the data entry(drop)</b>

In [104]:
df=iris.copy()
df.columns=['sl', 'sw', 'pl', 'pw', 'flower']
df.iloc[3:6, 1:3] = np.nan
df.head(10)

Unnamed: 0,sl,sw,pl,pw,flower
0,4.9,3.0,1.4,0.2,Iris-setosa
1,4.7,3.2,1.3,0.2,Iris-setosa
2,4.6,3.1,1.5,0.2,Iris-setosa
3,5.0,,,0.2,Iris-setosa
4,5.4,,,0.4,Iris-setosa
5,4.6,,,0.3,Iris-setosa
6,5.0,3.4,1.5,0.2,Iris-setosa
7,4.4,2.9,1.4,0.2,Iris-setosa
8,4.9,3.1,1.5,0.1,Iris-setosa
9,5.4,3.7,1.5,0.2,Iris-setosa


In [105]:
df.describe()

#look at sw and pl

Unnamed: 0,sl,sw,pl,pw
count,149.0,146.0,146.0,149.0
mean,5.848322,3.039041,3.821233,1.205369
std,0.828594,0.428691,1.74665,0.761292
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.4,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


In [106]:
# drop nan entries

df.dropna(inplace=True)
df.head(10)

# removes row -> 3, 4, 5

Unnamed: 0,sl,sw,pl,pw,flower
0,4.9,3.0,1.4,0.2,Iris-setosa
1,4.7,3.2,1.3,0.2,Iris-setosa
2,4.6,3.1,1.5,0.2,Iris-setosa
6,5.0,3.4,1.5,0.2,Iris-setosa
7,4.4,2.9,1.4,0.2,Iris-setosa
8,4.9,3.1,1.5,0.1,Iris-setosa
9,5.4,3.7,1.5,0.2,Iris-setosa
10,4.8,3.4,1.6,0.2,Iris-setosa
11,4.8,3.0,1.4,0.1,Iris-setosa
12,4.3,3.0,1.1,0.1,Iris-setosa


In [109]:
df.reset_index(drop=True, inplace=True)
df.head(10)

Unnamed: 0,sl,sw,pl,pw,flower
0,4.9,3.0,1.4,0.2,Iris-setosa
1,4.7,3.2,1.3,0.2,Iris-setosa
2,4.6,3.1,1.5,0.2,Iris-setosa
3,5.0,3.4,1.5,0.2,Iris-setosa
4,4.4,2.9,1.4,0.2,Iris-setosa
5,4.9,3.1,1.5,0.1,Iris-setosa
6,5.4,3.7,1.5,0.2,Iris-setosa
7,4.8,3.4,1.6,0.2,Iris-setosa
8,4.8,3.0,1.4,0.1,Iris-setosa
9,4.3,3.0,1.1,0.1,Iris-setosa


In [112]:
# Fill, nan entries

# lets find the mean of that column and fill it with that for now
# we can also fill the most occuring element in that column in those positions

df=iris.copy()
df.columns=['sl', 'sw', 'pl', 'pw', 'flower']
df.iloc[3:6, 1:3] = np.nan

df.sw.fillna(df.sw.mean(), inplace=True)
df.pl.fillna(df.pl.mean(), inplace=True)
df.head(10)

Unnamed: 0,sl,sw,pl,pw,flower
0,4.9,3.0,1.4,0.2,Iris-setosa
1,4.7,3.2,1.3,0.2,Iris-setosa
2,4.6,3.1,1.5,0.2,Iris-setosa
3,5.0,3.039041,3.821233,0.2,Iris-setosa
4,5.4,3.039041,3.821233,0.4,Iris-setosa
5,4.6,3.039041,3.821233,0.3,Iris-setosa
6,5.0,3.4,1.5,0.2,Iris-setosa
7,4.4,2.9,1.4,0.2,Iris-setosa
8,4.9,3.1,1.5,0.1,Iris-setosa
9,5.4,3.7,1.5,0.2,Iris-setosa


In [113]:
# we can also do this in such a way that we can choose only iris-setosa and find the mean of those flowers only

df=iris.copy()
df.columns=['sl', 'sw', 'pl', 'pw', 'flower']
df.iloc[3:6, 1:2] = np.nan

a = df[df.flower=='Iris-setosa']
df.sw.fillna(a.sw.mean(), inplace=True)
df.head(10)

Unnamed: 0,sl,sw,pl,pw,flower
0,4.9,3.0,1.4,0.2,Iris-setosa
1,4.7,3.2,1.3,0.2,Iris-setosa
2,4.6,3.1,1.5,0.2,Iris-setosa
3,5.0,3.402174,1.4,0.2,Iris-setosa
4,5.4,3.402174,1.7,0.4,Iris-setosa
5,4.6,3.402174,1.4,0.3,Iris-setosa
6,5.0,3.4,1.5,0.2,Iris-setosa
7,4.4,2.9,1.4,0.2,Iris-setosa
8,4.9,3.1,1.5,0.1,Iris-setosa
9,5.4,3.7,1.5,0.2,Iris-setosa


#### we always want to have our column data numeric to make calculations easier, because with strings, we can't really do calculations easily

In [114]:
df=iris.copy()
df.columns=['sl', 'sw', 'pl', 'pw', 'flower']
df["sex"]="female"
df.head(10)

Unnamed: 0,sl,sw,pl,pw,flower,sex
0,4.9,3.0,1.4,0.2,Iris-setosa,female
1,4.7,3.2,1.3,0.2,Iris-setosa,female
2,4.6,3.1,1.5,0.2,Iris-setosa,female
3,5.0,3.6,1.4,0.2,Iris-setosa,female
4,5.4,3.9,1.7,0.4,Iris-setosa,female
5,4.6,3.4,1.4,0.3,Iris-setosa,female
6,5.0,3.4,1.5,0.2,Iris-setosa,female
7,4.4,2.9,1.4,0.2,Iris-setosa,female
8,4.9,3.1,1.5,0.1,Iris-setosa,female
9,5.4,3.7,1.5,0.2,Iris-setosa,female


In [115]:
df.iloc[1:7, 5:6]='male'
df.head(10)

Unnamed: 0,sl,sw,pl,pw,flower,sex
0,4.9,3.0,1.4,0.2,Iris-setosa,female
1,4.7,3.2,1.3,0.2,Iris-setosa,male
2,4.6,3.1,1.5,0.2,Iris-setosa,male
3,5.0,3.6,1.4,0.2,Iris-setosa,male
4,5.4,3.9,1.7,0.4,Iris-setosa,male
5,4.6,3.4,1.4,0.3,Iris-setosa,male
6,5.0,3.4,1.5,0.2,Iris-setosa,male
7,4.4,2.9,1.4,0.2,Iris-setosa,female
8,4.9,3.1,1.5,0.1,Iris-setosa,female
9,5.4,3.7,1.5,0.2,Iris-setosa,female


In [119]:
def f (s) :
    if s=='male' :
        return 1
    else :
        return 0

df['gender'] = df.sex.apply(f) # f is a user defined function
df.head(15)

Unnamed: 0,sl,sw,pl,pw,flower,sex,gender
0,4.9,3.0,1.4,0.2,Iris-setosa,female,0
1,4.7,3.2,1.3,0.2,Iris-setosa,male,1
2,4.6,3.1,1.5,0.2,Iris-setosa,male,1
3,5.0,3.6,1.4,0.2,Iris-setosa,male,1
4,5.4,3.9,1.7,0.4,Iris-setosa,male,1
5,4.6,3.4,1.4,0.3,Iris-setosa,male,1
6,5.0,3.4,1.5,0.2,Iris-setosa,male,1
7,4.4,2.9,1.4,0.2,Iris-setosa,female,0
8,4.9,3.1,1.5,0.1,Iris-setosa,female,0
9,5.4,3.7,1.5,0.2,Iris-setosa,female,0


In [120]:
# now we can delete sex col. 'coz we don't need it anymore

del df['sex']
df.head(15)

Unnamed: 0,sl,sw,pl,pw,flower,gender
0,4.9,3.0,1.4,0.2,Iris-setosa,0
1,4.7,3.2,1.3,0.2,Iris-setosa,1
2,4.6,3.1,1.5,0.2,Iris-setosa,1
3,5.0,3.6,1.4,0.2,Iris-setosa,1
4,5.4,3.9,1.7,0.4,Iris-setosa,1
5,4.6,3.4,1.4,0.3,Iris-setosa,1
6,5.0,3.4,1.5,0.2,Iris-setosa,1
7,4.4,2.9,1.4,0.2,Iris-setosa,0
8,4.9,3.1,1.5,0.1,Iris-setosa,0
9,5.4,3.7,1.5,0.2,Iris-setosa,0


#### In data reading, we will only consider the columns or 'data type' which is useful
#### For example, if we have a dataset of a bus accident which says whether a person survived or not, and we have a column of usernames. So, the username column has no relation to survival of that person. Thus, we'll drop or delete that column.

### Golden Rules :

1. Understanding meaning of each column
2. Analyse columns which can be deleted.
3. Replace columns with string values(if any) with int values for analysis.