#**Pandas**

**What will you learn?**
1. Introduction to Pandas
2. Reading the Data
3. **Functionalities of Pandas** : Creation, Viewing, Editing
4. Manipulating Data
5. Handling NaN
6. **Handling Duplicates** : Row Index, Column Names
7. Handling String Data

Pandas is an open source library which provides high-performance, easy-to-use data structures and data analysis tools for the Python programming language. Pandas has a lot of functions that will help in reading and writing data and also for data manipulation. Thus we will be using pandas throughout the course.

Pandas behave like an excel file.

Lets import pandas and read some data.

In [None]:
#Import Pandas
import pandas as pd

##**Reading Data**

We will use **read_csv()** function. It reads a comma-separated values (csv) file into DataFrame.

In [None]:
#Loading data with read_csv() function. Here we are providing path to the csv file. 
#If the file is in your system you can provide its path as well.
iris = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data")

In [None]:
type(iris)

pandas.core.frame.DataFrame

##**Pandas Dataframes**

DataFrame is an object for data manipulation. You can think of it as a 2D tabular structure, where every row is a dataset entry and columns represents features of data.

In [None]:
iris

Unnamed: 0,5.1,3.5,1.4,0.2,Iris-setosa
0,4.9,3.0,1.4,0.2,Iris-setosa
1,4.7,3.2,1.3,0.2,Iris-setosa
2,4.6,3.1,1.5,0.2,Iris-setosa
3,5.0,3.6,1.4,0.2,Iris-setosa
4,5.4,3.9,1.7,0.4,Iris-setosa
...,...,...,...,...,...
144,6.7,3.0,5.2,2.3,Iris-virginica
145,6.3,2.5,5.0,1.9,Iris-virginica
146,6.5,3.0,5.2,2.0,Iris-virginica
147,6.2,3.4,5.4,2.3,Iris-virginica


By default, the first row of the csv file has been used as column names. We will soon see how to fix that.

###**Creating copy of DataFrame**

In [None]:
df = iris 
## Above statement simply makes df refer to the data frame object that iris is referring to. 
## So now both iris and df refer to the same dataframe object and any changes done via one will reflect in other.
## So effectively this is not creating another dataframe object.     

If we wish to create a copy then we will use **copy()** function for that

In [None]:
df = iris.copy()

In [None]:
df.shape

(149, 5)

As you can see, we have 149 rows and 5 columns. But actually, this should have been 150 rows, as we already know, the Iris Dataset has information of 3 different types of flower, 50 each. This happened because the first row was taken as the column name. To fix this, we do the following:

In [None]:
#Ignoring header -> If you don't want first row to be treated as a header, you can set header = None
iris = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data", header=None)
iris

Unnamed: 0,0,1,2,3,4
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica


In [None]:
df = iris.copy()
df.shape

(150, 5)

To see the datatypes of each column we do the following:

In [None]:
df.dtypes

0    float64
1    float64
2    float64
3    float64
4     object
dtype: object

Currently, our columns have no names.

In [None]:
df.columns

Int64Index([0, 1, 2, 3, 4], dtype='int64')

To give them a name, we simply change the value of df.columns

In [None]:
df.columns = ['sl', 'sw', 'pl', 'pw', 'flower_type']
df

Unnamed: 0,sl,sw,pl,pw,flower_type
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica


In [None]:
df.dtypes

sl             float64
sw             float64
pl             float64
pw             float64
flower_type     object
dtype: object

We may get a quick analysis of our data using **describe()**

In [None]:
df.describe()

Unnamed: 0,sl,sw,pl,pw
count,150.0,150.0,150.0,150.0
mean,5.843333,3.054,3.758667,1.198667
std,0.828066,0.433594,1.76442,0.763161
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


##**Some Basic Functionalties**

###**Viewing the DataFrame**

We have the **head()** and **tail()** function for viewing the dataframe.


####**head()**



This function returns the first n rows for the object based on position. It is useful for quickly testing if your object has the right type of data in it.

By default, value of n = 5.

In [None]:
df.head()

Unnamed: 0,sl,sw,pl,pw,flower_type
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


In [None]:
df.head(10)

Unnamed: 0,sl,sw,pl,pw,flower_type
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
5,5.4,3.9,1.7,0.4,Iris-setosa
6,4.6,3.4,1.4,0.3,Iris-setosa
7,5.0,3.4,1.5,0.2,Iris-setosa
8,4.4,2.9,1.4,0.2,Iris-setosa
9,4.9,3.1,1.5,0.1,Iris-setosa


####**tail()**

This function returns the last n rows for the object based on position. It is useful for quickly testing if your object has the right type of data in it.

By default, value of n = 5.

In [None]:
df.tail()

Unnamed: 0,sl,sw,pl,pw,flower_type
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica
149,5.9,3.0,5.1,1.8,Iris-virginica


In [None]:
df.tail(11)

Unnamed: 0,sl,sw,pl,pw,flower_type
139,6.9,3.1,5.4,2.1,Iris-virginica
140,6.7,3.1,5.6,2.4,Iris-virginica
141,6.9,3.1,5.1,2.3,Iris-virginica
142,5.8,2.7,5.1,1.9,Iris-virginica
143,6.8,3.2,5.9,2.3,Iris-virginica
144,6.7,3.3,5.7,2.5,Iris-virginica
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica


###**Accessing Data**

Sometimes, we may want to look at a single column from the DataFrame. This can be done simply as:

In [None]:
## Viewing sl column
df.sl

0      5.1
1      4.9
2      4.7
3      4.6
4      5.0
      ... 
145    6.7
146    6.3
147    6.5
148    6.2
149    5.9
Name: sl, Length: 150, dtype: float64

**and**

In [None]:
df['sl']

0      5.1
1      4.9
2      4.7
3      4.6
4      5.0
      ... 
145    6.7
146    6.3
147    6.5
148    6.2
149    5.9
Name: sl, Length: 150, dtype: float64

###**Checking for NULL values**

In [None]:
df.isnull()

Unnamed: 0,sl,sw,pl,pw,flower_type
0,False,False,False,False,False
1,False,False,False,False,False
2,False,False,False,False,False
3,False,False,False,False,False
4,False,False,False,False,False
...,...,...,...,...,...
145,False,False,False,False,False
146,False,False,False,False,False
147,False,False,False,False,False
148,False,False,False,False,False


In [None]:
# To get a direct overview 
df.isnull().sum()

sl             0
sw             0
pl             0
pw             0
flower_type    0
dtype: int64

###**Selection**

####**iloc[]**

We can use the **iloc[ ]** function to access values in dataframe.

It is a purely integer-location based indexing for selection by position. 
iloc[] is primarily integer position based (from 0 to length-1 of the axis), but may also be used with a boolean array.

Allowed inputs are:
1. An integer, e.g. 5.
2. A list or array of integers, e.g. [4, 3, 0].
3. A slice object with ints, e.g. 1:7.
4. A boolean array.

In [None]:
df.iloc[1:4, 2:4]

Unnamed: 0,pl,pw
1,1.4,0.2
2,1.3,0.2
3,1.5,0.2


####**loc[ ]**

This accesses a group of rows and columns by label(s) or a boolean array.

**.loc[ ]** is primarily label based, but may also be used with a boolean array.

Allowed inputs are:
1. A single label, e.g. 5 or 'a', (note that 5 is interpreted as a label of the index, and never as an integer position along the index).
2. A list or array of labels, e.g. ['a', 'b', 'c'].
3. A slice object with labels, e.g. 'a':'f'.
4. A boolean array of the same length as the axis being sliced, e.g. [True, False, True].

In [None]:
df1 = pd.DataFrame([[1, 2], [4, 5], [7, 8]],
     index=['cobra', 'viper', 'sidewinder'],
     columns=['max_speed', 'shield'])
df1

Unnamed: 0,max_speed,shield
cobra,1,2
viper,4,5
sidewinder,7,8


In [None]:
df1.loc['viper']

max_speed    4
shield       5
Name: viper, dtype: int64

In [None]:
df1.loc[['viper', 'sidewinder']]

Unnamed: 0,max_speed,shield
viper,4,5
sidewinder,7,8


###**DataFrame from Dictionary**

In [None]:
mydict = [{'a': 1, 'b': 2, 'c': 3, 'd': 4},
          {'a': 100, 'b': 200, 'c': 300, 'd': 400},
          {'a': 1000, 'b': 2000, 'c': 3000, 'd': 4000 }]
df1 = pd.DataFrame(mydict)
df1

Unnamed: 0,a,b,c,d
0,1,2,3,4
1,100,200,300,400
2,1000,2000,3000,4000


##**Manipulating data**

###**Deletion of data**

####**drop()**

Remove rows or columns by specifying label names and corresponding axis, or by specifying directly index or column names. When using a multi-index, labels on different levels can be removed by specifying the level.

It returns us a DataFrame without the removed index or column labels, or None if inplace=True.

In [None]:
df.head()

Unnamed: 0,sl,sw,pl,pw,flower_type
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


In [None]:
a = df.drop(0)
a.head()

Unnamed: 0,sl,sw,pl,pw,flower_type
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
5,5.4,3.9,1.7,0.4,Iris-setosa


To actually change the data in the original dataframe, we use the parameter 'inplace = True'

In [None]:
df.head()

Unnamed: 0,sl,sw,pl,pw,flower_type
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


In [None]:
df.drop(0, inplace = True)
df.head()

Unnamed: 0,sl,sw,pl,pw,flower_type
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
5,5.4,3.9,1.7,0.4,Iris-setosa


Let's try to do this again

In [None]:
df.drop(0, inplace = True)   #Error Generated
df.head()

KeyError: ignored

The reason for this is, after dropping 0, the indexing did not change automatically. Now, the labels do not begin from 0, but 1.

As we learnt in the definition, we are removing rows by their labels. To remove rows by their indices, we may do the following: 

In [None]:
df.drop(df.index[0], inplace = True)
df.head()

Unnamed: 0,sl,sw,pl,pw,flower_type
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
5,5.4,3.9,1.7,0.4,Iris-setosa
6,4.6,3.4,1.4,0.3,Iris-setosa


In [None]:
df.drop(df.index[3], inplace = True)   ## Label 5 removed
df.head()

Unnamed: 0,sl,sw,pl,pw,flower_type
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
6,4.6,3.4,1.4,0.3,Iris-setosa
7,5.0,3.4,1.5,0.2,Iris-setosa


We may also remove many labels in one go.

In [None]:
df.drop(df.index[[3, 4]], inplace = True)   ## Label 6, 7 removed
df.head()

Unnamed: 0,sl,sw,pl,pw,flower_type
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
8,4.4,2.9,1.4,0.2,Iris-setosa
9,4.9,3.1,1.5,0.1,Iris-setosa


In a similar manner, we may remove columns.

In [None]:
df.drop('sl')   ## Error Generated

KeyError: ignored

An error is generated because the drop function is currently looking for a row with label 'sl'. We need to change the axis.

In [None]:
df.drop('sl', axis = 1)

###**Conditional Insights**

We may use concept of boolean indexing in DataFrame to access a particular type of data, and draw inferenced from it.

In [None]:
df

Unnamed: 0,sl,sw,pl,pw,flower_type
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
8,4.4,2.9,1.4,0.2,Iris-setosa
9,4.9,3.1,1.5,0.1,Iris-setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica


Lets try to gain insights of data correspondign to Iris-virginica.

In [None]:
df[df.flower_type == 'Iris-virginica'].describe()

Unnamed: 0,sl,sw,pl,pw
count,50.0,50.0,50.0,50.0
mean,6.588,2.974,5.552,2.026
std,0.63588,0.322497,0.551895,0.27465
min,4.9,2.2,4.5,1.4
25%,6.225,2.8,5.1,1.8
50%,6.5,3.0,5.55,2.0
75%,6.9,3.175,5.875,2.3
max,7.9,3.8,6.9,2.5


###**Addition of data**

####**loc()**

In [None]:
df.loc[0] = [1, 2, 3, 4, 'Iris-virginica']
df.tail()

Unnamed: 0,sl,sw,pl,pw,flower_type
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica
149,5.9,3.0,5.1,1.8,Iris-virginica
0,1.0,2.0,3.0,4.0,Iris-virginica


We may directly create new columns also according to our needs.

In [None]:
df["diff_of_sl_sw"] = df['sl'] - df['sw']
df.head()

Unnamed: 0,sl,sw,pl,pw,flower_type,diff_of_sl_sw
2,4.7,3.2,1.3,0.2,Iris-setosa,1.5
3,4.6,3.1,1.5,0.2,Iris-setosa,1.5
4,5.0,3.6,1.4,0.2,Iris-setosa,1.4
8,4.4,2.9,1.4,0.2,Iris-setosa,1.5
9,4.9,3.1,1.5,0.1,Iris-setosa,1.8


In [None]:
df.drop('diff_of_sl_sw', axis = 1, inplace = True)

###**Reset Index**

After removing certain rows, the order of indices got changed. We can reset it using the **reset_index()** function.

In [None]:
df.reset_index()

Unnamed: 0,index,sl,sw,pl,pw,flower_type
0,2,4.7,3.2,1.3,0.2,Iris-setosa
1,3,4.6,3.1,1.5,0.2,Iris-setosa
2,4,5.0,3.6,1.4,0.2,Iris-setosa
3,8,4.4,2.9,1.4,0.2,Iris-setosa
4,9,4.9,3.1,1.5,0.1,Iris-setosa
...,...,...,...,...,...,...
141,146,6.3,2.5,5.0,1.9,Iris-virginica
142,147,6.5,3.0,5.2,2.0,Iris-virginica
143,148,6.2,3.4,5.4,2.3,Iris-virginica
144,149,5.9,3.0,5.1,1.8,Iris-virginica


But this has created an additional column with old indices. To avoid that, we do: 

In [None]:
df.reset_index(drop = True)

Unnamed: 0,sl,sw,pl,pw,flower_type
0,4.7,3.2,1.3,0.2,Iris-setosa
1,4.6,3.1,1.5,0.2,Iris-setosa
2,5.0,3.6,1.4,0.2,Iris-setosa
3,4.4,2.9,1.4,0.2,Iris-setosa
4,4.9,3.1,1.5,0.1,Iris-setosa
...,...,...,...,...,...
141,6.3,2.5,5.0,1.9,Iris-virginica
142,6.5,3.0,5.2,2.0,Iris-virginica
143,6.2,3.4,5.4,2.3,Iris-virginica
144,5.9,3.0,5.1,1.8,Iris-virginica


##**Handling NaN**

###**Values considered “missing”**


As data comes in many shapes and forms, pandas aims to be flexible with regard to handling missing data. While NaN is the default missing value marker for reasons of computational speed and convenience, we need to be able to easily detect this value with data of different types: floating point, integer, boolean, and general object. In many cases, however, the Python None will arise and we wish to also consider that “missing” or “not available” or “NA”.

To make detecting missing values easier (and across different array dtypes), pandas provides the **isna()** and **notna()** functions, which are also methods on Series and DataFrame objects.

Because NaN is a float, a column of integers with even one missing values is cast to floating-point dtype 

NaN values can create inaccuracies in our estimations and calculations. There are two ways we can handle NaN:
1. we either remove them, 
2. or we fill them.

Our current data does not have any NaN values, so we will create some.

In [None]:
import numpy as np
df = iris.copy()
df.columns = ['sl', 'sw', 'pl', 'pw', 'flower_type']

In [None]:
df.iloc[2:4, 1:3] = np.nan
df.head()

Unnamed: 0,sl,sw,pl,pw,flower_type
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,,,0.2,Iris-setosa
3,4.6,,,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


In [None]:
df.describe()

Unnamed: 0,sl,sw,pl,pw
count,150.0,148.0,148.0,150.0
mean,5.843333,3.052703,3.790541,1.198667
std,0.828066,0.436349,1.754618,0.763161
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.4,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


###**Dropping NaN**

**dropna()** : This will remove the row or column entries with NaN values.

In [None]:
df.dropna(inplace = True)  ## Remove NaN inside df only
df.reset_index(drop = True, inplace = True)   ## Reset the indices

In [None]:
df.head()

Unnamed: 0,sl,sw,pl,pw,flower_type
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,5.0,3.6,1.4,0.2,Iris-setosa
3,5.4,3.9,1.7,0.4,Iris-setosa
4,4.6,3.4,1.4,0.3,Iris-setosa


As you may observe, we have removed the row with NaN. If we want to remove the column, we shall use 'axis' parameter.

###**Filling NaN**

**fillna()** : You can also fill NaN using a dict or Series that is alignable. The labels of the dict or index of the Series must match the columns of the frame you wish to fill. 

Generally we fill the NaN values with the mean, but depending on the type of data, and your own analysis, you may decide to will NaN in some other way.

In [None]:
df.iloc[2:4, 1:3] = np.nan
df.head()

Unnamed: 0,sl,sw,pl,pw,flower_type
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,5.0,,,0.2,Iris-setosa
3,5.4,,,0.4,Iris-setosa
4,4.6,3.4,1.4,0.3,Iris-setosa


In [None]:
df.sw.fillna(df.sw.mean(), inplace = True)
df.pl.fillna(df.pl.mean(), inplace = True)
df.head()

Unnamed: 0,sl,sw,pl,pw,flower_type
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,5.0,3.043151,3.821233,0.2,Iris-setosa
3,5.4,3.043151,3.821233,0.4,Iris-setosa
4,4.6,3.4,1.4,0.3,Iris-setosa


**Note**: Since all the NaN values belonged to 'Iris-setosa', a better value to fill NaN's would have been the mean of those values of 'sw', where flower type is Iris-setosa.

In [None]:
df.iloc[2:4, 1:3] = np.nan
df.head()

Unnamed: 0,sl,sw,pl,pw,flower_type
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,5.0,,,0.2,Iris-setosa
3,5.4,,,0.4,Iris-setosa
4,4.6,3.4,1.4,0.3,Iris-setosa


In [None]:
df_setosa = df[df.flower_type == 'Iris-setosa']
df.sw.fillna(df_setosa.sw.mean(), inplace = True)
df.pl.fillna(df_setosa.pl.mean(), inplace = True)
df.head()

Unnamed: 0,sl,sw,pl,pw,flower_type
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,5.0,3.415217,1.463043,0.2,Iris-setosa
3,5.4,3.415217,1.463043,0.4,Iris-setosa
4,4.6,3.4,1.4,0.3,Iris-setosa


##**Duplicate Labels**

Index objects are not required to be unique; you can have duplicate row or column labels. 

But one of pandas’ roles is to clean messy, real-world data before it goes to some downstream system. And real-world data has duplicates, even in fields that are supposed to be unique.

Lets see how duplicate labels change the behavior of certain operations, and how prevent duplicates from arising during operations, or to detect them if they do.

###**Consequences of Duplicate Labels**

Some pandas methods (Series.reindex() for example) just don’t work with duplicates present. The output can’t be determined, and so pandas raises.

Other methods, like indexing, can give very surprising results. Typically indexing with a scalar will reduce dimensionality. Slicing a DataFrame with a scalar will return a Series. Slicing a Series with a scalar will return a scalar. But with duplicates, this isn’t the case.

In [None]:
 df1 = pd.DataFrame([[0, 1, 2], [3, 4, 5]], columns=["A", "A", "B"])
 df1

Unnamed: 0,A,A.1,B
0,0,1,2
1,3,4,5


We have duplicates in the columns. If we slice 'B', we get back a Series

In [None]:
print(df1["B"])  # a series
type(df1["B"])

0    2
1    5
Name: B, dtype: int64


pandas.core.series.Series

But slicing 'A' returns a DataFrame

In [None]:
print(df1["A"]) # a DataFrame
type(df1["A"])  

   A  A
0  0  1
1  3  4


pandas.core.frame.DataFrame

This applies to row labels as well.

In [None]:
df2 = pd.DataFrame({"A": [0, 1, 2]}, index=["a", "a", "b"])
df2

Unnamed: 0,A
a,0
a,1
b,2


In [None]:
df2.loc["b", "A"]  # a scalar

2

In [None]:
df2.loc["a", "A"]  # a Series

a    0
a    1
Name: A, dtype: int64

###**Duplicate Label Detection**

You can check whether an Index (storing the row or column labels) is unique with **Index.is_unique**:

In [None]:
df2

Unnamed: 0,A
a,0
a,1
b,2


In [None]:
df2.index.is_unique

False

In [None]:
df2.columns.is_unique

True

**Index.duplicated()** will return a boolean ndarray indicating whether a label is repeated.

In [None]:
df2.index.duplicated()

array([False,  True, False])

##**Handling Strings in Data**

Our algorithms can make calculations over numerical data. String data is very hard to compute quantitaviely. 

It wont make sense to ignore string data. For example, if a dataset is to evaluate shopping habits, and we have a column for gender with categories as 'male' and 'female', we cannot just ignore this, as the habits of both the gender will be very different from each other.

So, to handle such cases, we convert the string data to numerical data.

In [None]:
df

Unnamed: 0,sl,sw,pl,pw,flower_type
0,5.1,3.500000,1.400000,0.2,Iris-setosa
1,4.9,3.000000,1.400000,0.2,Iris-setosa
2,5.0,3.415217,1.463043,0.2,Iris-setosa
3,5.4,3.415217,1.463043,0.4,Iris-setosa
4,4.6,3.400000,1.400000,0.3,Iris-setosa
...,...,...,...,...,...
143,6.7,3.000000,5.200000,2.3,Iris-virginica
144,6.3,2.500000,5.000000,1.9,Iris-virginica
145,6.5,3.000000,5.200000,2.0,Iris-virginica
146,6.2,3.400000,5.400000,2.3,Iris-virginica


Lets create a dummy column to understand the process.

In [None]:
df['Gender'] = 'Female'
df.iloc[0:10, 5] = 'Male'
df

Unnamed: 0,sl,sw,pl,pw,flower_type,Gender
0,5.1,3.500000,1.400000,0.2,Iris-setosa,Male
1,4.9,3.000000,1.400000,0.2,Iris-setosa,Male
2,5.0,3.415217,1.463043,0.2,Iris-setosa,Male
3,5.4,3.415217,1.463043,0.4,Iris-setosa,Male
4,4.6,3.400000,1.400000,0.3,Iris-setosa,Male
...,...,...,...,...,...,...
143,6.7,3.000000,5.200000,2.3,Iris-virginica,Female
144,6.3,2.500000,5.000000,1.9,Iris-virginica,Female
145,6.5,3.000000,5.200000,2.0,Iris-virginica,Female
146,6.2,3.400000,5.400000,2.3,Iris-virginica,Female


In [None]:
def func(s):
  if s == 'Male':
    return 0
  else:
    return 1

df['Sex'] = df.Gender.apply(func)
del df['Gender']
df

Unnamed: 0,sl,sw,pl,pw,flower_type,Sex
0,5.1,3.500000,1.400000,0.2,Iris-setosa,0
1,4.9,3.000000,1.400000,0.2,Iris-setosa,0
2,5.0,3.415217,1.463043,0.2,Iris-setosa,0
3,5.4,3.415217,1.463043,0.4,Iris-setosa,0
4,4.6,3.400000,1.400000,0.3,Iris-setosa,0
...,...,...,...,...,...,...
143,6.7,3.000000,5.200000,2.3,Iris-virginica,1
144,6.3,2.500000,5.000000,1.9,Iris-virginica,1
145,6.5,3.000000,5.200000,2.0,Iris-virginica,1
146,6.2,3.400000,5.400000,2.3,Iris-virginica,1


Now, we may apply algorithms which take into consideration the 'Sex' column too.