## EXPLORING DATAFRAME

### Dataframe.shape

#### We have loaded the Iris Dataset in the variable iris_df. Before diving into the data, it would be valuable to know the number of datapoints we have and the overall size of the dataset. It is useful to look at the volume of data we are dealing with.

In [1]:
import pandas as pd
from sklearn.datasets import load_iris

iris = load_iris()
iris_df = pd.DataFrame(data=iris['data'], columns=iris['feature_names'])

In [2]:
iris_df.shape

(150, 4)

#### From the above, we are dealing with 150 rows and 4 columns of data. Each row represents one datapoint and each column represents a single feature associated with the data frame. Therefore, there are 150 datapoints containing 4 features each.

### DataFrame.columns

#### We will have four(4) columns below. The columns attribute tells us the name of the columns and basically nothing else. This attribute assumes importance when we want to identify the features a dataset contains.

In [3]:
iris_df.columns

Index(['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)',
       'petal width (cm)'],
      dtype='object')

### DataFrame.info

#### From the output, we can make some observations: 
#### i. The DataType of each column: In this dataset, all of the data is stored as 64-bit floating-point numbers.
#### ii. Number of Non-Null values: Dealing with null values is an important step in data preparation. It will be dealt with later in the notebook.

In [4]:
iris_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 4 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   sepal length (cm)  150 non-null    float64
 1   sepal width (cm)   150 non-null    float64
 2   petal length (cm)  150 non-null    float64
 3   petal width (cm)   150 non-null    float64
dtypes: float64(4)
memory usage: 4.8 KB


### DataFrame.describe()

#### The output above shows the total number of data points, mean, standard deviation, minimum, lower quartile(25%), median(50%), upper quartile(75%) and the maximum value of each column.

In [5]:
iris_df.describe()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
count,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333
std,0.828066,0.435866,1.765298,0.762238
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


### DataFrame.head

#### From the output, we can see five(5) entries of the dataset. If we look at the index at the left, we find out that these are the first five rows.

In [6]:
iris_df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


### DataFrame.tail

#### Another way of looking at the data can be from the end(instead of the beginning). The flipside of DataFrame.head is DataFrame.tail, which returns the last five rows of a DataFrame:

In [7]:
iris_df.tail()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
145,6.7,3.0,5.2,2.3
146,6.3,2.5,5.0,1.9
147,6.5,3.0,5.2,2.0
148,6.2,3.4,5.4,2.3
149,5.9,3.0,5.1,1.8


## Missing Data

### None: non-float missing data

##### Because None comes from Python, it cannot be used in NumPy and pandas arrays that are not of data type 'object'. Remember, NumPy arrays (and the data structures in pandas) can contain only one type of data. This is what gives them their tremendous power for large-scale data and computational work, but it also limits their flexibility. Such arrays have to upcast to the “lowest common denominator,” the data type that will encompass everything in the array. When None is in the array, it means you are working with Python objects.

##### To see this in action, consider the following example array (note the dtype for it):

In [8]:
import numpy as np

example1 = np.array([2, None, 6, 8])
example1

array([2, None, 6, 8], dtype=object)

#### NaN and None: null values in pandas

##### Even though NaN and None can behave somewhat differently, pandas is nevertheless built to handle them interchangeably. To see what we mean, consider a Series of integers:

In [9]:
int_series = pd.Series([1, 2, 3], dtype=int)
int_series

0    1
1    2
2    3
dtype: int32

### Detecting null values

In [10]:
example3 = pd.Series([0, np.nan, '', None])

In [11]:
example3.isnull()

0    False
1     True
2    False
3     True
dtype: bool

In [12]:
# If we want the total number of missing values, we can just do a sum over the mask produced by the isnull() method.

example3.isnull().sum()

2

### Dropping null values

In [13]:
example3 = example3.dropna()
example3

0    0
2     
dtype: object

In [14]:
example4 = pd.DataFrame([[1,      np.nan, 7], 
                         [2,      5,      8], 
                         [np.nan, 6,      9]])
example4

Unnamed: 0,0,1,2
0,1.0,,7
1,2.0,5.0,8
2,,6.0,9


### Filling null values

##### Categorical Data(Non-numeric)

In [15]:
fill_with_mode = pd.DataFrame([[1,2,"True"],
                               [3,4,None],
                               [5,6,"False"],
                               [7,8,"True"],
                               [9,10,"True"]])

fill_with_mode

Unnamed: 0,0,1,2
0,1,2,True
1,3,4,
2,5,6,False
3,7,8,True
4,9,10,True


In [16]:
# Now, lets first find the mode before filling the None value with the mode.

fill_with_mode[2].value_counts()

True     3
False    1
Name: 2, dtype: int64

In [17]:
# So, we will replace None with True

fill_with_mode[2].fillna('True',inplace=True)

In [18]:
fill_with_mode

Unnamed: 0,0,1,2
0,1,2,True
1,3,4,True
2,5,6,False
3,7,8,True
4,9,10,True


### Numeric Data

##### Now, coming to numeric data. Here, we have a two common ways of replacing missing values:

##### Replace with Median of the row
##### Replace with Mean of the row
##### We replace with Median, in case of skewed data with outliers. This is because median is robust to outliers.

##### When the data is normalized, we can use mean, as in that case, mean and median would be pretty close.

##### First, let us take a column which is normally distributed and let us fill the missing value with the mean of the column.

In [19]:
fill_with_mean = pd.DataFrame([[-2,0,1],
                               [-1,2,3],
                               [np.nan,4,5],
                               [1,6,7],
                               [2,8,9]])

fill_with_mean

Unnamed: 0,0,1,2
0,-2.0,0,1
1,-1.0,2,3
2,,4,5
3,1.0,6,7
4,2.0,8,9


In [20]:
# The mean of the column is

np.mean(fill_with_mean[0])

0.0

In [21]:
# Filling with mean

fill_with_mean[0].fillna(np.mean(fill_with_mean[0]),inplace=True)
fill_with_mean

Unnamed: 0,0,1,2
0,-2.0,0,1
1,-1.0,2,3
2,0.0,4,5
3,1.0,6,7
4,2.0,8,9


### Encoding Categorical Data

##### Machine learning models only deal with numbers and any form of numeric data. It won't be able to tell the difference between a Yes and a No, but it would be able to distinguish between 0 and 1. So, after filling in the missing values, we need to do encode the categorical data to some numeric form for the model to understand.

#### LABEL ENCODING

##### Label encoding is basically converting each category to a number. For example, say we have a dataset of airline passengers and there is a column containing their class among the following ['business class', 'economy class','first class']. If Label encoding is done on this, this would be transformed to [0,1,2]. Let us see an example via code. As we would be learning scikit-learn in the upcoming notebooks, we won't use it here.

In [22]:
label = pd.DataFrame([
                      [10,'business class'],
                      [20,'first class'],
                      [30, 'economy class'],
                      [40, 'economy class'],
                      [50, 'economy class'],
                      [60, 'business class']
],columns=['ID','class'])
label

Unnamed: 0,ID,class
0,10,business class
1,20,first class
2,30,economy class
3,40,economy class
4,50,economy class
5,60,business class


In [23]:
# To perform label encoding on the 1st column, we have to first describe a mapping from each class to a number, before replacing

class_labels = {'business class':0,'economy class':1,'first class':2}
label['class'] = label['class'].replace(class_labels)
label

Unnamed: 0,ID,class
0,10,0
1,20,2
2,30,1
3,40,1
4,50,1
5,60,0


#### Removing duplicate data

##### In addition to missing data, you will often encounter duplicated data in real-world datasets. Fortunately, pandas provides an easy means of detecting and removing duplicate entries.

#### Identifying duplicates: duplicated

##### You can easily spot duplicate values using the duplicated method in pandas, which returns a Boolean mask indicating whether an entry in a DataFrame is a duplicate of an earlier one. Let's create another example DataFrame to see this in action.

In [24]:
example6 = pd.DataFrame({'letters': ['A','B'] * 2 + ['B'],
                         'numbers': [1, 2, 1, 3, 3]})
example6

Unnamed: 0,letters,numbers
0,A,1
1,B,2
2,A,1
3,B,3
4,B,3


In [25]:
example6.duplicated()

0    False
1    False
2     True
3    False
4     True
dtype: bool

#### Dropping duplicates: drop_duplicates

##### drop_duplicates simply returns a copy of the data for which all of the duplicated values are False:

In [26]:
example6.drop_duplicates()

Unnamed: 0,letters,numbers
0,A,1
1,B,2
3,B,3


In [27]:
# bBoth duplicated and drop_duplicates default to consider all columns but you can specify that they examine only a subset of columns in your DataFrame:

example6.drop_duplicates(['letters'])

Unnamed: 0,letters,numbers
0,A,1
1,B,2
