# Pandas Tutorial : Day 5
Here's is what we are going to learn today : 
* [Drop not-null data](#1)
 1. [Drop columns](#2)
 2. [Drop a row by index](#3)
 3. [Drop columns and/or rows of MultiIndex DataFrame](#4)
* [Drop null data](#5)
 1. [Drop the rows where at least one element is missing](#6)
 2. [Drop the columns where atleast one element is missing](#7)
 3. [Drop the rows where all the elements are missing](#8)
 4. [Keep only the rows with atleast 2 non-NA values](#9)
 5. [Define in which columns to look for missing values](#10)
 6. [Keep the DataFrame with valid entries in the same variable](#11)
* [Convert data types](#12)
 1. [Cast all the columns to one data type](#13)
 2. [Cast single column to data type](#14)
* [apply function](#15)
 1. [Method 1](#16)
 2. [Method 2](#17)

Let's get started!

[Data for daily news for stock market prediction](https://www.kaggle.com/aaron7sun/stocknews)

In [1]:
# import library
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import numpy as np
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/stocknews/upload_DJIA_table.csv
/kaggle/input/stocknews/Combined_News_DJIA.csv
/kaggle/input/stocknews/RedditNews.csv


In [2]:
# import data
df = pd.read_csv('/kaggle/input/stocknews/upload_DJIA_table.csv')

In [3]:
# look at the top 5 rows in the data
df.head()

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close
0,2016-07-01,17924.240234,18002.380859,17916.910156,17949.369141,82160000,17949.369141
1,2016-06-30,17712.759766,17930.609375,17711.800781,17929.990234,133030000,17929.990234
2,2016-06-29,17456.019531,17704.509766,17456.019531,17694.679688,106380000,17694.679688
3,2016-06-28,17190.509766,17409.720703,17190.509766,17409.720703,112190000,17409.720703
4,2016-06-27,17355.210938,17355.210938,17063.080078,17140.240234,138740000,17140.240234


## Drop not-null Data<a id='1'></a>
Drop specified labels from rows or columns.

Remove rows or columns by specifying label names and corresponding axis, or by specifying directly index or column names. When using a multi-index, labels on different levels can be removed by specifying the level.

### 1. Drop columns<a id='2'></a>
Syntax 1: `df.drop(['column_name1', 'column_name2', ...], axis = 1)`

If you permanently want to drop column use `inplace = true`. The default value of inplace is False

Syntax 2 : `df.drop(columns = ['column_name1', 'column_name2', ...])`

In [4]:
# let's drop date column by syntax 1
df.drop(['Date'], axis = 1)

Unnamed: 0,Open,High,Low,Close,Volume,Adj Close
0,17924.240234,18002.380859,17916.910156,17949.369141,82160000,17949.369141
1,17712.759766,17930.609375,17711.800781,17929.990234,133030000,17929.990234
2,17456.019531,17704.509766,17456.019531,17694.679688,106380000,17694.679688
3,17190.509766,17409.720703,17190.509766,17409.720703,112190000,17409.720703
4,17355.210938,17355.210938,17063.080078,17140.240234,138740000,17140.240234
...,...,...,...,...,...,...
1984,11532.070312,11718.280273,11450.889648,11615.929688,159790000,11615.929688
1985,11632.809570,11633.780273,11453.339844,11532.959961,182550000,11532.959961
1986,11781.700195,11782.349609,11601.519531,11642.469727,173590000,11642.469727
1987,11729.669922,11867.110352,11675.530273,11782.349609,183190000,11782.349609


In [5]:
# let's drop open column by syntax 2
df.drop(columns = ['Open'])

Unnamed: 0,Date,High,Low,Close,Volume,Adj Close
0,2016-07-01,18002.380859,17916.910156,17949.369141,82160000,17949.369141
1,2016-06-30,17930.609375,17711.800781,17929.990234,133030000,17929.990234
2,2016-06-29,17704.509766,17456.019531,17694.679688,106380000,17694.679688
3,2016-06-28,17409.720703,17190.509766,17409.720703,112190000,17409.720703
4,2016-06-27,17355.210938,17063.080078,17140.240234,138740000,17140.240234
...,...,...,...,...,...,...
1984,2008-08-14,11718.280273,11450.889648,11615.929688,159790000,11615.929688
1985,2008-08-13,11633.780273,11453.339844,11532.959961,182550000,11532.959961
1986,2008-08-12,11782.349609,11601.519531,11642.469727,173590000,11642.469727
1987,2008-08-11,11867.110352,11675.530273,11782.349609,183190000,11782.349609


In [6]:
# check for the condition if Dataframe
df.head()

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close
0,2016-07-01,17924.240234,18002.380859,17916.910156,17949.369141,82160000,17949.369141
1,2016-06-30,17712.759766,17930.609375,17711.800781,17929.990234,133030000,17929.990234
2,2016-06-29,17456.019531,17704.509766,17456.019531,17694.679688,106380000,17694.679688
3,2016-06-28,17190.509766,17409.720703,17190.509766,17409.720703,112190000,17409.720703
4,2016-06-27,17355.210938,17355.210938,17063.080078,17140.240234,138740000,17140.240234


### 2. Drop a row by index<a id='3'></a>
Syntax : `df.drop(['index'])`

In [7]:
df.drop([0, 1])

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close
2,2016-06-29,17456.019531,17704.509766,17456.019531,17694.679688,106380000,17694.679688
3,2016-06-28,17190.509766,17409.720703,17190.509766,17409.720703,112190000,17409.720703
4,2016-06-27,17355.210938,17355.210938,17063.080078,17140.240234,138740000,17140.240234
5,2016-06-24,17946.630859,17946.630859,17356.339844,17400.750000,239000000,17400.750000
6,2016-06-23,17844.109375,18011.070312,17844.109375,18011.070312,98070000,18011.070312
...,...,...,...,...,...,...,...
1984,2008-08-14,11532.070312,11718.280273,11450.889648,11615.929688,159790000,11615.929688
1985,2008-08-13,11632.809570,11633.780273,11453.339844,11532.959961,182550000,11532.959961
1986,2008-08-12,11781.700195,11782.349609,11601.519531,11642.469727,173590000,11642.469727
1987,2008-08-11,11729.669922,11867.110352,11675.530273,11782.349609,183190000,11782.349609


The first two rows are dropped.

### 3. Drop columns and/or rows of MultiIndex DataFrame<a id='4'></a>
Syntax 1: `df.drop(index = 'first_index', columns = 'column_name')`

Syntax 2 : `df.drop(index = 'second_index', level = 1)`

In [8]:
# let's make a multiIndex Dataframe
midx = pd.MultiIndex(levels=[['lama', 'cow', 'falcon'],
                             ['speed', 'weight', 'length']],
                     codes=[[0, 0, 0, 1, 1, 1, 2, 2, 2],
                            [0, 1, 2, 0, 1, 2, 0, 1, 2]])
df1 = pd.DataFrame(index=midx, columns=['big', 'small'],
                  data=[[45, 30], [200, 100], [1.5, 1], [30, 20],
                        [250, 150], [1.5, 0.8], [320, 250],
                        [1, 0.8], [0.3, 0.2]])
df1

Unnamed: 0,Unnamed: 1,big,small
lama,speed,45.0,30.0
lama,weight,200.0,100.0
lama,length,1.5,1.0
cow,speed,30.0,20.0
cow,weight,250.0,150.0
cow,length,1.5,0.8
falcon,speed,320.0,250.0
falcon,weight,1.0,0.8
falcon,length,0.3,0.2


In [9]:
# to drop the whole 'cow' row and 'small' column
df1.drop(index='cow', columns='small')

Unnamed: 0,Unnamed: 1,big
lama,speed,45.0
lama,weight,200.0
lama,length,1.5
falcon,speed,320.0
falcon,weight,1.0
falcon,length,0.3


In [10]:
# to drop the subindex 'length'
df1.drop(index='length', level=1)

Unnamed: 0,Unnamed: 1,big,small
lama,speed,45.0,30.0
lama,weight,200.0,100.0
cow,speed,30.0,20.0
cow,weight,250.0,150.0
falcon,speed,320.0,250.0
falcon,weight,1.0,0.8


## Drop null data<a id='5'></a>
Let's see how to remove missing values from tha dataset. 

**Note :** We don't have any missing values in this dataset. Therefore we'll use our own data that has missing values.

In [11]:
df2 = pd.DataFrame({"name": ['Alfred', 'Batman', 'Catwoman'],
                   "toy": [np.nan, 'Batmobile', 'Bullwhip'],
                   "born": [pd.NaT, "1940-04-25",
                            pd.NaT]})
df2

Unnamed: 0,name,toy,born
0,Alfred,,NaT
1,Batman,Batmobile,1940-04-25
2,Catwoman,Bullwhip,NaT


### 1. Drop the rows where at least one element is missing<a id='6'></a>
Syntax : `df.dropna()`

In [12]:
df2.dropna()

Unnamed: 0,name,toy,born
1,Batman,Batmobile,1940-04-25


### 2. Drop the columns where atleast one element is missing<a id='7'></a>
Syntax : `df.dropna(axis = 'columns')`

In [13]:
df2.dropna(axis = 'columns')

Unnamed: 0,name
0,Alfred
1,Batman
2,Catwoman


### 3. Drop the rows where all the elements are missing<a id='8'></a>
Syntax : `df.dropna(how = 'all')`

In [14]:
df2.dropna(how = 'all')

Unnamed: 0,name,toy,born
0,Alfred,,NaT
1,Batman,Batmobile,1940-04-25
2,Catwoman,Bullwhip,NaT


### 4. Keep only the rows with atleast 2 non-NA values<a id='9'></a>
Syntax : `df.dropna(thresh = 2)`

In [15]:
df2.dropna(thresh = 2)

Unnamed: 0,name,toy,born
1,Batman,Batmobile,1940-04-25
2,Catwoman,Bullwhip,NaT


### 5. Define in which columns to look for missing values<a id='10'></a>
Syntax : `df.dropna(subset = ['column_name'])`

In [16]:
df2.dropna(subset=['name', 'born'])

Unnamed: 0,name,toy,born
1,Batman,Batmobile,1940-04-25


### 6. Keep the DataFrame with valid entries in the same variable<a id='11'></a>

In [17]:
df2.dropna(inplace = True)
df2

Unnamed: 0,name,toy,born
1,Batman,Batmobile,1940-04-25


## Convert Data types<a id='12'></a>

In [18]:
# let's see the datatype of each column
df.dtypes

Date          object
Open         float64
High         float64
Low          float64
Close        float64
Volume         int64
Adj Close    float64
dtype: object

### 1. Cast all the columns to one data type<a id='13'></a>
Syntax : `df.astype('data_type').dtypes`

In [19]:
d = {'col1': [1, 2], 'col2': [3, 4]}
df3 = pd.DataFrame(data=d)
df3.dtypes

col1    int64
col2    int64
dtype: object

In [20]:
df3.astype('int32').dtypes

col1    int32
col2    int32
dtype: object

### 2. Cast single column to data type<a id='14'></a>
Syntax : `df.astype({'column_name' : 'datatype'}).dtypes`

In [21]:
df3.astype({'col1': 'float'}).dtypes

col1    float64
col2      int64
dtype: object

## Apply Function<a id='15'></a>
Pandas.apply allow the users to pass a function and apply it on every single value of the Pandas series. It comes as a huge improvement for the pandas library as this function helps to segregate data according to the conditions required due to which it is efficiently used in data science and machine learning.

Syntax : `s.apply(func, convert_dtype=True, args=())`

Parameters:

* **func:** .apply takes a function and applies it to all values of pandas series.
* **convert_dtype:** Convert dtype as per the function’s operation.
* **args=():** Additional arguments to pass to function instead of series.
* **Return Type:** Pandas Series after applied function/operation.

### Method 1<a id='16'></a>

In [None]:
def method1(x):
    return x * 2
df.Open.apply(method1).head(5)

### Method 2<a id='17'></a>

In [None]:
df.Open.apply(lambda x : x * 2).head()

Yeahhhh!!! Here we finish our tutorial in pandas! So far we have learnt many things. So keep practicing guys! 