# Introduction to pandas:
[*pandas*](http://pandas.pydata.org/) is a column-oriented data analysis API.
It's a great tool for handling and analyzing input data, and many ML frameworks
support *pandas* data structures as inputs.  

Although a comprehensive
introduction to the API would span many pages, the core concepts are fairly
straightforward, and we'll present them below. For a more complete reference,
the [*pandas* docs site](http://pandas.pydata.org/pandas-docs/stable/index.html)
contains extensive documentation and many tutorials. (Note that Colab may use a
slightly older version number, but the parts of *pandas* covered here are
unlikely to differ from version to version.)

In [1]:
import pandas as pd
pd.__version__

'2.2.2'

# Series and DataFrame:
The primary data structures in *pandas* are implemented as two classes:
* **`Series`**, which is a single column. Each row can be labeled via an index. A DataFrame contains one or more Series and a name for each Series.
* **`DataFrame`**, which you can imagine as a relational data table, with rows and named columns.

The data frame is a commonly used abstraction for data manipulation. Similar implementations exist in Spark and R.

### Series:
think of series as,
* A single column of data
* Like a list, but with superpowers:
    * It has values
    * It has labels (called an index)

In [2]:
cities = pd.Series(["chennai", "mumbai", "kolkata", "delhi"])
cities


Unnamed: 0,0
0,chennai
1,mumbai
2,kolkata
3,delhi


In [3]:
type(cities)

we can label them ourselves

In [4]:
cities = pd.Series({"south":"Chennai", "west":"Mumbai", "east":"Kolkata", "north":"New Delhi"})

cities

Unnamed: 0,0
south,Chennai
west,Mumbai
east,Kolkata
north,New Delhi


### DataFrame:
* `DataFrame` is a stack of a bunch of `Series` side by side each with a column name.
* its rows have indices
* Columns have names and are actually `Series` under the hood



In [5]:
cities = pd.Series(["chennai", "mumbai", "kolkata", "delhi"])
population = pd.Series([700000, 1700000, 800000])

city_info_df = pd.DataFrame({"cities": cities, "population": population})
city_info_df

Unnamed: 0,cities,population
0,chennai,700000.0
1,mumbai,1700000.0
2,kolkata,800000.0
3,delhi,


In [6]:
type(city_info_df)

# Exploring data in DataFrame:

In [7]:
from sklearn.datasets import load_diabetes


?load_diabetes

In [8]:
diabetes = load_diabetes(as_frame=True)
print(diabetes.DESCR)

.. _diabetes_dataset:

Diabetes dataset
----------------

Ten baseline variables, age, sex, body mass index, average blood
pressure, and six blood serum measurements were obtained for each of n =
442 diabetes patients, as well as the response of interest, a
quantitative measure of disease progression one year after baseline.

**Data Set Characteristics:**

:Number of Instances: 442

:Number of Attributes: First 10 columns are numeric predictive values

:Target: Column 11 is a quantitative measure of disease progression one year after baseline

:Attribute Information:
    - age     age in years
    - sex
    - bmi     body mass index
    - bp      average blood pressure
    - s1      tc, total serum cholesterol
    - s2      ldl, low-density lipoproteins
    - s3      hdl, high-density lipoproteins
    - s4      tch, total cholesterol / HDL
    - s5      ltg, possibly log of serum triglycerides level
    - s6      glu, blood sugar level

Note: Each of these 10 feature variables have bee

* It gives you a **Bunch** object (a fancy dictionary), which contains several fields like this
```python
dict_keys([
    'data',            # numpy array of shape (442, 10): feature data
    'target',          # numpy array of shape (442,): continuous target values
    'frame',           # pandas DataFrame, optional (available if return_X_y=False and as_frame=True)
    'feature_names',   # list of 10 feature names
    'DESCR',           # long string description of the dataset
    'data_filename',   # path to .csv file with data (for legacy purposes)
    'target_filename'  # path to .csv file with target values (for legacy purposes)
])

```

In [9]:
type(diabetes)

In [10]:
diabetes.feature_names

['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']

In [11]:
diabetes.data[:5]

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019907,-0.017646
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.068332,-0.092204
2,0.085299,0.05068,0.044451,-0.00567,-0.045599,-0.034194,-0.032356,-0.002592,0.002861,-0.02593
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022688,-0.009362
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031988,-0.046641


In [12]:
df = diabetes["data"]
df.shape

(442, 10)

442 rows and 10 columns

In [13]:
df.columns

Index(['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6'], dtype='object')

It gives pandas index type which is like a list but not python basic list. But it can be transformed to python list

In [14]:
list(df.columns)

['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']

In [15]:
df.info

In [16]:
# first five entries
df.head()

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019907,-0.017646
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.068332,-0.092204
2,0.085299,0.05068,0.044451,-0.00567,-0.045599,-0.034194,-0.032356,-0.002592,0.002861,-0.02593
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022688,-0.009362
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031988,-0.046641


In [17]:
# Last ten entries
df.tail(10)

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6
432,0.009016,-0.044642,0.055229,-0.00567,0.057597,0.044719,-0.002903,0.023239,0.055686,0.106617
433,-0.02731,-0.044642,-0.060097,-0.02977,0.046589,0.01998,0.122273,-0.039493,-0.051404,-0.009362
434,0.016281,-0.044642,0.001339,0.008101,0.005311,0.010899,0.030232,-0.039493,-0.045424,0.032059
435,-0.01278,-0.044642,-0.023451,-0.040099,-0.016704,0.004636,-0.017629,-0.002592,-0.03846,-0.038357
436,-0.05637,-0.044642,-0.074108,-0.050427,-0.02496,-0.047034,0.09282,-0.076395,-0.061176,-0.046641
437,0.041708,0.05068,0.019662,0.059744,-0.005697,-0.002566,-0.028674,-0.002592,0.031193,0.007207
438,-0.005515,0.05068,-0.015906,-0.067642,0.049341,0.079165,-0.028674,0.034309,-0.018114,0.044485
439,0.041708,0.05068,-0.015906,0.017293,-0.037344,-0.01384,-0.024993,-0.01108,-0.046883,0.015491
440,-0.045472,-0.044642,0.039062,0.001215,0.016318,0.015283,-0.028674,0.02656,0.044529,-0.02593
441,-0.045472,-0.044642,-0.07303,-0.081413,0.08374,0.027809,0.173816,-0.039493,-0.004222,0.003064


In [18]:
df.describe()

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6
count,442.0,442.0,442.0,442.0,442.0,442.0,442.0,442.0,442.0,442.0
mean,-2.511817e-19,1.23079e-17,-2.245564e-16,-4.79757e-17,-1.3814990000000001e-17,3.9184340000000004e-17,-5.777179e-18,-9.04254e-18,9.293722000000001e-17,1.130318e-17
std,0.04761905,0.04761905,0.04761905,0.04761905,0.04761905,0.04761905,0.04761905,0.04761905,0.04761905,0.04761905
min,-0.1072256,-0.04464164,-0.0902753,-0.1123988,-0.1267807,-0.1156131,-0.1023071,-0.0763945,-0.1260971,-0.1377672
25%,-0.03729927,-0.04464164,-0.03422907,-0.03665608,-0.03424784,-0.0303584,-0.03511716,-0.03949338,-0.03324559,-0.03317903
50%,0.00538306,-0.04464164,-0.007283766,-0.005670422,-0.004320866,-0.003819065,-0.006584468,-0.002592262,-0.001947171,-0.001077698
75%,0.03807591,0.05068012,0.03124802,0.03564379,0.02835801,0.02984439,0.0293115,0.03430886,0.03243232,0.02791705
max,0.1107267,0.05068012,0.1705552,0.1320436,0.1539137,0.198788,0.1811791,0.1852344,0.1335973,0.1356118


In [19]:
df.describe().T # to transpose

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
age,442.0,-2.511817e-19,0.047619,-0.107226,-0.037299,0.005383,0.038076,0.110727
sex,442.0,1.23079e-17,0.047619,-0.044642,-0.044642,-0.044642,0.05068,0.05068
bmi,442.0,-2.245564e-16,0.047619,-0.090275,-0.034229,-0.007284,0.031248,0.170555
bp,442.0,-4.79757e-17,0.047619,-0.112399,-0.036656,-0.00567,0.035644,0.132044
s1,442.0,-1.3814990000000001e-17,0.047619,-0.126781,-0.034248,-0.004321,0.028358,0.153914
s2,442.0,3.9184340000000004e-17,0.047619,-0.115613,-0.030358,-0.003819,0.029844,0.198788
s3,442.0,-5.777179e-18,0.047619,-0.102307,-0.035117,-0.006584,0.029312,0.181179
s4,442.0,-9.04254e-18,0.047619,-0.076395,-0.039493,-0.002592,0.034309,0.185234
s5,442.0,9.293722000000001e-17,0.047619,-0.126097,-0.033246,-0.001947,0.032432,0.133597
s6,442.0,1.130318e-17,0.047619,-0.137767,-0.033179,-0.001078,0.027917,0.135612


In [20]:
df.describe(percentiles=[0.2, 0.4, 0.6])

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6
count,442.0,442.0,442.0,442.0,442.0,442.0,442.0,442.0,442.0,442.0
mean,-2.511817e-19,1.23079e-17,-2.245564e-16,-4.79757e-17,-1.3814990000000001e-17,3.9184340000000004e-17,-5.777179e-18,-9.04254e-18,9.293722000000001e-17,1.130318e-17
std,0.04761905,0.04761905,0.04761905,0.04761905,0.04761905,0.04761905,0.04761905,0.04761905,0.04761905,0.04761905
min,-0.1072256,-0.04464164,-0.0902753,-0.1123988,-0.1267807,-0.1156131,-0.1023071,-0.0763945,-0.1260971,-0.1377672
20%,-0.04547248,-0.04464164,-0.04048038,-0.04009893,-0.03871969,-0.03695017,-0.03971921,-0.03949338,-0.04117617,-0.03835666
40%,-0.005514555,-0.04464164,-0.01806189,-0.01944183,-0.0120262,-0.01559345,-0.01762938,-0.007684617,-0.01811369,-0.01184718
50%,0.00538306,-0.04464164,-0.007283766,-0.005670422,-0.004320866,-0.003819065,-0.006584468,-0.002592262,-0.001947171,-0.001077698
60%,0.01628068,0.05068012,0.005218854,0.008100982,0.00806271,0.008706873,0.008142084,-0.002592262,0.01255119,0.007206516
max,0.1107267,0.05068012,0.1705552,0.1320436,0.1539137,0.198788,0.1811791,0.1852344,0.1335973,0.1356118


# Selection:

In [21]:
df.columns

Index(['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6'], dtype='object')

In [22]:
df["age"]

Unnamed: 0,age
0,0.038076
1,-0.001882
2,0.085299
3,-0.089063
4,0.005383
...,...
437,0.041708
438,-0.005515
439,0.041708
440,-0.045472


In [23]:
type(df["age"])

In [24]:
df["age"][:5]

Unnamed: 0,age
0,0.038076
1,-0.001882
2,0.085299
3,-0.089063
4,0.005383


In [25]:
df["age"][100:200]

Unnamed: 0,age
100,0.016281
101,0.016281
102,-0.092695
103,0.059871
104,-0.027310
...,...
195,0.027178
196,-0.023677
197,0.048974
198,-0.052738


In [26]:
df[["age", "sex"]]

Unnamed: 0,age,sex
0,0.038076,0.050680
1,-0.001882,-0.044642
2,0.085299,0.050680
3,-0.089063,-0.044642
4,0.005383,-0.044642
...,...,...
437,0.041708,0.050680
438,-0.005515,0.050680
439,0.041708,0.050680
440,-0.045472,-0.044642


In [27]:
type(df[["age", "sex"]])

In [28]:
df[["age", "sex"]][:5]

Unnamed: 0,age,sex
0,0.038076,0.05068
1,-0.001882,-0.044642
2,0.085299,0.05068
3,-0.089063,-0.044642
4,0.005383,-0.044642


In [29]:
df[["age", "sex"]][10:20]

Unnamed: 0,age,sex
10,-0.096328,-0.044642
11,0.027178,0.05068
12,0.016281,-0.044642
13,0.005383,0.05068
14,0.045341,-0.044642
15,-0.052738,0.05068
16,-0.005515,-0.044642
17,0.070769,0.05068
18,-0.038207,-0.044642
19,-0.02731,-0.044642


`iloc`
* Integer Location based indexing
* takes row and column indices
* use integers only


In [30]:
df.iloc[0] # returns first row

Unnamed: 0,0
age,0.038076
sex,0.05068
bmi,0.061696
bp,0.021872
s1,-0.044223
s2,-0.034821
s3,-0.043401
s4,-0.002592
s5,0.019907
s6,-0.017646


In [31]:
df.iloc[2, 4] # 3rd row 5th column

np.float64(-0.04559945128264711)

In [32]:
df.iloc[1:6] # 2nd row to 6th row

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.068332,-0.092204
2,0.085299,0.05068,0.044451,-0.00567,-0.045599,-0.034194,-0.032356,-0.002592,0.002861,-0.02593
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022688,-0.009362
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031988,-0.046641
5,-0.092695,-0.044642,-0.040696,-0.019442,-0.068991,-0.079288,0.041277,-0.076395,-0.041176,-0.096346


In [33]:
df.iloc[:, 1] # all rows first column

Unnamed: 0,sex
0,0.050680
1,-0.044642
2,0.050680
3,-0.044642
4,-0.044642
...,...
437,0.050680
438,0.050680
439,0.050680
440,-0.044642


In [34]:
df.iloc[5, [1, 3, 5, 7]] # 5th row and 1, 3, 5, 7th columns

Unnamed: 0,5
sex,-0.044642
bp,-0.019442
s2,-0.079288
s4,-0.076395


`loc`
* label based indexing
* takes actual names of the columns



In [35]:
df.loc[0] # row with label 0 (the first row)

Unnamed: 0,0
age,0.038076
sex,0.05068
bmi,0.061696
bp,0.021872
s1,-0.044223
s2,-0.034821
s3,-0.043401
s4,-0.002592
s5,0.019907
s6,-0.017646


In [36]:
df.loc[5, "age"] # 4th row column named age

np.float64(-0.09269547780327612)

In [37]:
df.loc[6:10,"sex"] # 6th to 10th row and sex column

Unnamed: 0,sex
6,0.05068
7,0.05068
8,0.05068
9,-0.044642
10,-0.044642


unlike `iloc`, `loc` slicing is inclusive on both ends.

In [38]:
df.loc[5,["age", "sex"]] # 4th row, age and sex column

Unnamed: 0,5
age,-0.092695
sex,-0.044642


In [39]:
row_condition_met = df.age > 5.383060e-03

In [40]:
df[row_condition_met]

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6
0,0.038076,0.050680,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019907,-0.017646
2,0.085299,0.050680,0.044451,-0.005670,-0.045599,-0.034194,-0.032356,-0.002592,0.002861,-0.025930
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031988,-0.046641
7,0.063504,0.050680,-0.001895,0.066629,0.090620,0.108914,0.022869,0.017703,-0.035816,0.003064
8,0.041708,0.050680,0.061696,-0.040099,-0.013953,0.006202,-0.028674,-0.002592,-0.014960,0.011349
...,...,...,...,...,...,...,...,...,...,...
431,0.070769,0.050680,-0.030996,0.021872,-0.037344,-0.047034,0.033914,-0.039493,-0.014960,-0.001078
432,0.009016,-0.044642,0.055229,-0.005670,0.057597,0.044719,-0.002903,0.023239,0.055686,0.106617
434,0.016281,-0.044642,0.001339,0.008101,0.005311,0.010899,0.030232,-0.039493,-0.045424,0.032059
437,0.041708,0.050680,0.019662,0.059744,-0.005697,-0.002566,-0.028674,-0.002592,0.031193,0.007207


In [41]:
df.loc[df["age"] > 5.383060e-03]

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6
0,0.038076,0.050680,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019907,-0.017646
2,0.085299,0.050680,0.044451,-0.005670,-0.045599,-0.034194,-0.032356,-0.002592,0.002861,-0.025930
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031988,-0.046641
7,0.063504,0.050680,-0.001895,0.066629,0.090620,0.108914,0.022869,0.017703,-0.035816,0.003064
8,0.041708,0.050680,0.061696,-0.040099,-0.013953,0.006202,-0.028674,-0.002592,-0.014960,0.011349
...,...,...,...,...,...,...,...,...,...,...
431,0.070769,0.050680,-0.030996,0.021872,-0.037344,-0.047034,0.033914,-0.039493,-0.014960,-0.001078
432,0.009016,-0.044642,0.055229,-0.005670,0.057597,0.044719,-0.002903,0.023239,0.055686,0.106617
434,0.016281,-0.044642,0.001339,0.008101,0.005311,0.010899,0.030232,-0.039493,-0.045424,0.032059
437,0.041708,0.050680,0.019662,0.059744,-0.005697,-0.002566,-0.028674,-0.002592,0.031193,0.007207


In [42]:
age_df_temp = df.loc[df["age"] < 5.383060e-03]
age_df_temp

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.068332,-0.092204
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022688,-0.009362
5,-0.092695,-0.044642,-0.040696,-0.019442,-0.068991,-0.079288,0.041277,-0.076395,-0.041176,-0.096346
6,-0.045472,0.050680,-0.047163,-0.015999,-0.040096,-0.024800,0.000779,-0.039493,-0.062917,-0.038357
9,-0.070900,-0.044642,0.039062,-0.033213,-0.012577,-0.034508,-0.024993,-0.002592,0.067737,-0.013504
...,...,...,...,...,...,...,...,...,...,...
435,-0.012780,-0.044642,-0.023451,-0.040099,-0.016704,0.004636,-0.017629,-0.002592,-0.038460,-0.038357
436,-0.056370,-0.044642,-0.074108,-0.050427,-0.024960,-0.047034,0.092820,-0.076395,-0.061176,-0.046641
438,-0.005515,0.050680,-0.015906,-0.067642,0.049341,0.079165,-0.028674,0.034309,-0.018114,0.044485
440,-0.045472,-0.044642,0.039062,0.001215,0.016318,0.015283,-0.028674,0.026560,0.044529,-0.025930


In [43]:
age_df_temp = df[df["age"] < 5.383060e-03]
age_df_temp

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.068332,-0.092204
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022688,-0.009362
5,-0.092695,-0.044642,-0.040696,-0.019442,-0.068991,-0.079288,0.041277,-0.076395,-0.041176,-0.096346
6,-0.045472,0.050680,-0.047163,-0.015999,-0.040096,-0.024800,0.000779,-0.039493,-0.062917,-0.038357
9,-0.070900,-0.044642,0.039062,-0.033213,-0.012577,-0.034508,-0.024993,-0.002592,0.067737,-0.013504
...,...,...,...,...,...,...,...,...,...,...
435,-0.012780,-0.044642,-0.023451,-0.040099,-0.016704,0.004636,-0.017629,-0.002592,-0.038460,-0.038357
436,-0.056370,-0.044642,-0.074108,-0.050427,-0.024960,-0.047034,0.092820,-0.076395,-0.061176,-0.046641
438,-0.005515,0.050680,-0.015906,-0.067642,0.049341,0.079165,-0.028674,0.034309,-0.018114,0.044485
440,-0.045472,-0.044642,0.039062,0.001215,0.016318,0.015283,-0.028674,0.026560,0.044529,-0.025930


In [44]:
age_df_temp.head()

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.068332,-0.092204
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022688,-0.009362
5,-0.092695,-0.044642,-0.040696,-0.019442,-0.068991,-0.079288,0.041277,-0.076395,-0.041176,-0.096346
6,-0.045472,0.05068,-0.047163,-0.015999,-0.040096,-0.0248,0.000779,-0.039493,-0.062917,-0.038357
9,-0.0709,-0.044642,0.039062,-0.033213,-0.012577,-0.034508,-0.024993,-0.002592,0.067737,-0.013504


In [45]:
age_df_temp.iloc[1] # 2nd row

Unnamed: 0,3
age,-0.089063
sex,-0.044642
bmi,-0.011595
bp,-0.036656
s1,0.012191
s2,0.024991
s3,-0.036038
s4,0.034309
s5,0.022688
s6,-0.009362


In [46]:
age_df_temp.loc[1] # row with label 1



Unnamed: 0,1
age,-0.001882
sex,-0.044642
bmi,-0.051474
bp,-0.026328
s1,-0.008449
s2,-0.019163
s3,0.074412
s4,-0.039493
s5,-0.068332
s6,-0.092204


age_df_temp.loc[0] will return key error since row with label 0 is not in age_df_temp

In [47]:
age_df_temp = df.loc[(df.age < 5.383060e-03) & (df.sex > -4.464164e-02 )]
age_df_temp.shape

(214, 10)

In [48]:
age_df_temp

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.068332,-0.092204
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022688,-0.009362
5,-0.092695,-0.044642,-0.040696,-0.019442,-0.068991,-0.079288,0.041277,-0.076395,-0.041176,-0.096346
6,-0.045472,0.050680,-0.047163,-0.015999,-0.040096,-0.024800,0.000779,-0.039493,-0.062917,-0.038357
9,-0.070900,-0.044642,0.039062,-0.033213,-0.012577,-0.034508,-0.024993,-0.002592,0.067737,-0.013504
...,...,...,...,...,...,...,...,...,...,...
435,-0.012780,-0.044642,-0.023451,-0.040099,-0.016704,0.004636,-0.017629,-0.002592,-0.038460,-0.038357
436,-0.056370,-0.044642,-0.074108,-0.050427,-0.024960,-0.047034,0.092820,-0.076395,-0.061176,-0.046641
438,-0.005515,0.050680,-0.015906,-0.067642,0.049341,0.079165,-0.028674,0.034309,-0.018114,0.044485
440,-0.045472,-0.044642,0.039062,0.001215,0.016318,0.015283,-0.028674,0.026560,0.044529,-0.025930


In [49]:
list("ABCD")

['A', 'B', 'C', 'D']

In [50]:
list("abcdefghi")

['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i']

In [51]:
range(100)

range(0, 100)

In [52]:
import numpy as np

In [53]:
another_df = pd.DataFrame(
    np.random.rand(100, 4),
    index=range(10, 110),
    columns=list("ABCD"))

* `np.random.rand(100, 4)` generates a $100 * 4$ NumPy array full of random floats between $0$ and $1$.
* `range(10, 110)` sets row indices from 10 to 109.
* `list("ABCD")` sets columns as $A, B, C, D$

In [54]:
another_df.head()

Unnamed: 0,A,B,C,D
10,0.986698,0.73176,0.035151,0.759674
11,0.804287,0.190376,0.005665,0.878452
12,0.64326,0.354524,0.895132,0.339561
13,0.390085,0.588743,0.975739,0.053271
14,0.667213,0.403447,0.730176,0.90016


In [55]:
another_df.tail()

Unnamed: 0,A,B,C,D
105,0.329867,0.357207,0.346603,0.983008
106,0.177136,0.196821,0.395253,0.743686
107,0.096016,0.470611,0.122461,0.790591
108,0.383758,0.720125,0.812502,0.216686
109,0.501403,0.67235,0.605314,0.846392


In [56]:
df = pd.DataFrame(
    np.random.rand(9, 4),
    index=list("abcdefghi"),
    columns=list("ABCD"))
# we are changing the df to be a random 9 X 4 numpy array with row indices as abcdefghi and columns as ABCD

df.shape

(9, 4)

In [57]:
df.columns

Index(['A', 'B', 'C', 'D'], dtype='object')

In [58]:
df.info

In [59]:
df.head()

Unnamed: 0,A,B,C,D
a,0.146897,0.700869,0.324619,0.547216
b,0.534709,0.836482,0.107944,0.510428
c,0.989438,0.842522,0.856823,0.532569
d,0.314379,0.734212,0.670485,0.484577
e,0.732703,0.641724,0.449436,0.602133


In [60]:
df.index

Index(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i'], dtype='object')

In [61]:
df.loc["a"]

Unnamed: 0,a
A,0.146897
B,0.700869
C,0.324619
D,0.547216


In [62]:
df.loc["a", "A":"D"] # remember loc is inclusive on both ends

Unnamed: 0,a
A,0.146897
B,0.700869
C,0.324619
D,0.547216


In [63]:
df.loc["a", ["A", "C"]]

Unnamed: 0,a
A,0.146897
C,0.324619


In [64]:
df.loc["a":"d", :]

Unnamed: 0,A,B,C,D
a,0.146897,0.700869,0.324619,0.547216
b,0.534709,0.836482,0.107944,0.510428
c,0.989438,0.842522,0.856823,0.532569
d,0.314379,0.734212,0.670485,0.484577


In [65]:
df.iloc[8]

Unnamed: 0,i
A,0.470513
B,0.449648
C,0.2753
D,0.724618


In [66]:
df.iloc[:4, 0:3] # iloc slicing is exclusive on right side

Unnamed: 0,A,B,C
a,0.146897,0.700869,0.324619
b,0.534709,0.836482,0.107944
c,0.989438,0.842522,0.856823
d,0.314379,0.734212,0.670485


In [67]:
df.iloc[:4, 1:3]

Unnamed: 0,B,C
a,0.700869,0.324619
b,0.836482,0.107944
c,0.842522,0.856823
d,0.734212,0.670485


In [68]:
df.iloc[:4]

Unnamed: 0,A,B,C,D
a,0.146897,0.700869,0.324619,0.547216
b,0.534709,0.836482,0.107944,0.510428
c,0.989438,0.842522,0.856823,0.532569
d,0.314379,0.734212,0.670485,0.484577


In [69]:
df.iloc[:, 1]

Unnamed: 0,B
a,0.700869
b,0.836482
c,0.842522
d,0.734212
e,0.641724
f,0.190896
g,0.910119
h,0.492428
i,0.449648


In [70]:
selector = lambda df: df["A"] > 0.5
selector

<function __main__.<lambda>(df)>

In [71]:
df.loc[selector]

Unnamed: 0,A,B,C,D
b,0.534709,0.836482,0.107944,0.510428
c,0.989438,0.842522,0.856823,0.532569
e,0.732703,0.641724,0.449436,0.602133
g,0.54819,0.910119,0.70561,0.108185


In [72]:
df.loc[df["A"]>0.5]

Unnamed: 0,A,B,C,D
b,0.534709,0.836482,0.107944,0.510428
c,0.989438,0.842522,0.856823,0.532569
e,0.732703,0.641724,0.449436,0.602133
g,0.54819,0.910119,0.70561,0.108185


In [73]:
selector = lambda df: (df["A"]>0.5) & (df["B"] < 0.2) # returns elements where both conditions are met equivalent to python "and"

df.loc[selector]

Unnamed: 0,A,B,C,D


In [74]:
df.loc[(df["A"]>0.5) & (df["B"]<0.2)]

Unnamed: 0,A,B,C,D


`df.loc[df["A"] > 0.2]` returns the array while `df["A"] > 0.2` returns boolean values

In [75]:
condition_for_selection = (df["A"] > 0.5) & (df["B"] < 0.2)
condition_for_selection

Unnamed: 0,0
a,False
b,False
c,False
d,False
e,False
f,False
g,False
h,False
i,False


In [76]:
condition_for_selection = (df["A"] > 0.5) | ~(df["B"] < 0.8)
# | is equivalent for python "or"
# ~ is equivalent for python "not"

condition_for_selection

Unnamed: 0,0
a,False
b,True
c,True
d,False
e,True
f,False
g,True
h,False
i,False


In [77]:
df.loc[condition_for_selection]


Unnamed: 0,A,B,C,D
b,0.534709,0.836482,0.107944,0.510428
c,0.989438,0.842522,0.856823,0.532569
e,0.732703,0.641724,0.449436,0.602133
g,0.54819,0.910119,0.70561,0.108185


# Adding a column to the DataFrame

In [78]:
df["E"] = df["A"] * 100
df

Unnamed: 0,A,B,C,D,E
a,0.146897,0.700869,0.324619,0.547216,14.689659
b,0.534709,0.836482,0.107944,0.510428,53.470866
c,0.989438,0.842522,0.856823,0.532569,98.943789
d,0.314379,0.734212,0.670485,0.484577,31.437914
e,0.732703,0.641724,0.449436,0.602133,73.270305
f,0.343363,0.190896,0.517353,0.7946,34.336268
g,0.54819,0.910119,0.70561,0.108185,54.819038
h,0.071705,0.492428,0.965765,0.685038,7.170476
i,0.470513,0.449648,0.2753,0.724618,47.051292


In [79]:
df["F"] = df["A"] + df["C"]
df

Unnamed: 0,A,B,C,D,E,F
a,0.146897,0.700869,0.324619,0.547216,14.689659,0.471515
b,0.534709,0.836482,0.107944,0.510428,53.470866,0.642652
c,0.989438,0.842522,0.856823,0.532569,98.943789,1.846261
d,0.314379,0.734212,0.670485,0.484577,31.437914,0.984864
e,0.732703,0.641724,0.449436,0.602133,73.270305,1.182139
f,0.343363,0.190896,0.517353,0.7946,34.336268,0.860715
g,0.54819,0.910119,0.70561,0.108185,54.819038,1.2538
h,0.071705,0.492428,0.965765,0.685038,7.170476,1.03747
i,0.470513,0.449648,0.2753,0.724618,47.051292,0.745813


In [82]:
criteria = df["A"] < 0.2
criteria

Unnamed: 0,A
a,True
b,False
c,False
d,False
e,False
f,False
g,False
h,True
i,False


In [85]:
df.loc[criteria, "A"] = 0 # selects the rows with criteria and column A then replaces their values with 0
df

Unnamed: 0,A,B,C,D,E,F
a,0.0,0.700869,0.324619,0.547216,14.689659,0.471515
b,0.534709,0.836482,0.107944,0.510428,53.470866,0.642652
c,0.989438,0.842522,0.856823,0.532569,98.943789,1.846261
d,0.314379,0.734212,0.670485,0.484577,31.437914,0.984864
e,0.732703,0.641724,0.449436,0.602133,73.270305,1.182139
f,0.343363,0.190896,0.517353,0.7946,34.336268,0.860715
g,0.54819,0.910119,0.70561,0.108185,54.819038,1.2538
h,0.0,0.492428,0.965765,0.685038,7.170476,1.03747
i,0.470513,0.449648,0.2753,0.724618,47.051292,0.745813


In [91]:
cities = ["Chennai", "Mumbai", "Delhi", "Kolkata", "Bengalure", "Hyderabad", "Pune", "Trichy", "Vizag"]
df["city"] = cities
df

Unnamed: 0,A,B,C,D,E,F,city
a,0.0,0.700869,0.324619,0.547216,14.689659,0.471515,Chennai
b,0.534709,0.836482,0.107944,0.510428,53.470866,0.642652,Mumbai
c,0.989438,0.842522,0.856823,0.532569,98.943789,1.846261,Delhi
d,0.314379,0.734212,0.670485,0.484577,31.437914,0.984864,Kolkata
e,0.732703,0.641724,0.449436,0.602133,73.270305,1.182139,Bengalure
f,0.343363,0.190896,0.517353,0.7946,34.336268,0.860715,Hyderabad
g,0.54819,0.910119,0.70561,0.108185,54.819038,1.2538,Pune
h,0.0,0.492428,0.965765,0.685038,7.170476,1.03747,Trichy
i,0.470513,0.449648,0.2753,0.724618,47.051292,0.745813,Vizag


In [92]:
df_copy = df.copy()
df_copy


Unnamed: 0,A,B,C,D,E,F,city
a,0.0,0.700869,0.324619,0.547216,14.689659,0.471515,Chennai
b,0.534709,0.836482,0.107944,0.510428,53.470866,0.642652,Mumbai
c,0.989438,0.842522,0.856823,0.532569,98.943789,1.846261,Delhi
d,0.314379,0.734212,0.670485,0.484577,31.437914,0.984864,Kolkata
e,0.732703,0.641724,0.449436,0.602133,73.270305,1.182139,Bengalure
f,0.343363,0.190896,0.517353,0.7946,34.336268,0.860715,Hyderabad
g,0.54819,0.910119,0.70561,0.108185,54.819038,1.2538,Pune
h,0.0,0.492428,0.965765,0.685038,7.170476,1.03747,Trichy
i,0.470513,0.449648,0.2753,0.724618,47.051292,0.745813,Vizag


In [93]:
new_cities = ["Chennai", "Mumbai", "Delhi", "Kolkata", "Bengalure", "Hyderabad", "Coimbatore", "Trichy", "Vizag"]
df_copy["new_city"] = new_cities
df_copy

Unnamed: 0,A,B,C,D,E,F,city,new_city
a,0.0,0.700869,0.324619,0.547216,14.689659,0.471515,Chennai,Chennai
b,0.534709,0.836482,0.107944,0.510428,53.470866,0.642652,Mumbai,Mumbai
c,0.989438,0.842522,0.856823,0.532569,98.943789,1.846261,Delhi,Delhi
d,0.314379,0.734212,0.670485,0.484577,31.437914,0.984864,Kolkata,Kolkata
e,0.732703,0.641724,0.449436,0.602133,73.270305,1.182139,Bengalure,Bengalure
f,0.343363,0.190896,0.517353,0.7946,34.336268,0.860715,Hyderabad,Hyderabad
g,0.54819,0.910119,0.70561,0.108185,54.819038,1.2538,Pune,Coimbatore
h,0.0,0.492428,0.965765,0.685038,7.170476,1.03747,Trichy,Trichy
i,0.470513,0.449648,0.2753,0.724618,47.051292,0.745813,Vizag,Vizag


In [95]:
criteria = df_copy["city"].isin(["Pune", "Hyderabad", "Bengaluru"]) # returns boolean values

In [96]:
df_copy.loc[df.city == "Bengalure",["city", "new_city"]] = "Bengaluru"
df_copy

Unnamed: 0,A,B,C,D,E,F,city,new_city
a,0.0,0.700869,0.324619,0.547216,14.689659,0.471515,Chennai,Chennai
b,0.534709,0.836482,0.107944,0.510428,53.470866,0.642652,Mumbai,Mumbai
c,0.989438,0.842522,0.856823,0.532569,98.943789,1.846261,Delhi,Delhi
d,0.314379,0.734212,0.670485,0.484577,31.437914,0.984864,Kolkata,Kolkata
e,0.732703,0.641724,0.449436,0.602133,73.270305,1.182139,Bengaluru,Bengaluru
f,0.343363,0.190896,0.517353,0.7946,34.336268,0.860715,Hyderabad,Hyderabad
g,0.54819,0.910119,0.70561,0.108185,54.819038,1.2538,Pune,Coimbatore
h,0.0,0.492428,0.965765,0.685038,7.170476,1.03747,Trichy,Trichy
i,0.470513,0.449648,0.2753,0.724618,47.051292,0.745813,Vizag,Vizag


In [120]:
# lets add a column and remove it with drop function
df_copy["drop_column"] = [1, 2, 3, 4, 5, 6, 7, 8, 9]
df_copy

Unnamed: 0,A,B,C,D,E,F,city,new_city,drop_column
a,0.0,0.700869,0.324619,0.547216,14.689659,0.471515,Chennai,Chennai,1
b,0.534709,0.836482,0.107944,0.510428,53.470866,0.642652,Mumbai,Mumbai,2
c,0.989438,0.842522,0.856823,0.532569,98.943789,1.846261,Delhi,Delhi,3
d,0.314379,0.734212,0.670485,0.484577,31.437914,0.984864,Kolkata,Kolkata,4
e,0.732703,0.641724,0.449436,0.602133,73.270305,1.182139,Bengaluru,Bengaluru,5
f,0.343363,0.190896,0.517353,0.7946,34.336268,0.860715,Hyderabad,Hyderabad,6
g,0.54819,0.910119,0.70561,0.108185,54.819038,1.2538,Pune,Coimbatore,7
h,0.0,0.492428,0.965765,0.685038,7.170476,1.03747,Trichy,Trichy,8
i,0.470513,0.449648,0.2753,0.724618,47.051292,0.745813,Vizag,Vizag,9


In [99]:
?df.drop

In [121]:
df_copy = df_copy.drop(["drop_column"], axis = 1) # axis = 1 to drop columns
df_copy

Unnamed: 0,A,B,C,D,E,F,city,new_city
a,0.0,0.700869,0.324619,0.547216,14.689659,0.471515,Chennai,Chennai
b,0.534709,0.836482,0.107944,0.510428,53.470866,0.642652,Mumbai,Mumbai
c,0.989438,0.842522,0.856823,0.532569,98.943789,1.846261,Delhi,Delhi
d,0.314379,0.734212,0.670485,0.484577,31.437914,0.984864,Kolkata,Kolkata
e,0.732703,0.641724,0.449436,0.602133,73.270305,1.182139,Bengaluru,Bengaluru
f,0.343363,0.190896,0.517353,0.7946,34.336268,0.860715,Hyderabad,Hyderabad
g,0.54819,0.910119,0.70561,0.108185,54.819038,1.2538,Pune,Coimbatore
h,0.0,0.492428,0.965765,0.685038,7.170476,1.03747,Trichy,Trichy
i,0.470513,0.449648,0.2753,0.724618,47.051292,0.745813,Vizag,Vizag


In [103]:
df_copy = df_copy.drop(["city"]) # this throws an erroe since axis = 0 as default which denotes rows and there is no row with label city

KeyError: "['city'] not found in axis"

In [104]:
?df_copy.sample

In [122]:
df_copy.sample(3) # returns 3 random samples from the dataset on every time the script is run

Unnamed: 0,A,B,C,D,E,F,city,new_city
a,0.0,0.700869,0.324619,0.547216,14.689659,0.471515,Chennai,Chennai
d,0.314379,0.734212,0.670485,0.484577,31.437914,0.984864,Kolkata,Kolkata
e,0.732703,0.641724,0.449436,0.602133,73.270305,1.182139,Bengaluru,Bengaluru


* setting a `random_state` gives the same random samples on every time the script is run
* It is like fixing the randomness to a constant so that it does the same thing everytime
* The number 42 can be any non negative integer from 0 to 2<sup>32</sup>.
* `random_state` is set to 42 commonly

In [112]:
df_copy.sample(3, random_state = 42)

Unnamed: 0,A,B,C,D,E,F,city,new_city,drop_column
h,0.0,0.492428,0.965765,0.685038,7.170476,1.03747,Trichy,Trichy,8
b,0.534709,0.836482,0.107944,0.510428,53.470866,0.642652,Mumbai,Mumbai,2
f,0.343363,0.190896,0.517353,0.7946,34.336268,0.860715,Hyderabad,Hyderabad,6


In [129]:
df_copy.sample(3, replace=True) # allows multiple occurences of same row. Try running multiple times to see two rows occur at same time

Unnamed: 0,A,B,C,D,E,F,city,new_city
a,0.0,0.700869,0.324619,0.547216,14.689659,0.471515,Chennai,Chennai
d,0.314379,0.734212,0.670485,0.484577,31.437914,0.984864,Kolkata,Kolkata
a,0.0,0.700869,0.324619,0.547216,14.689659,0.471515,Chennai,Chennai
