# Pandas Tutorial : Day 5
Here's is what we are going to learn today : 
* [Drop not-null data](#1)
 1. [Drop columns](#2)
 2. [Drop a row by index](#3)
 3. [Drop columns and/or rows of MultiIndex DataFrame](#4)
* [Drop null data](#5)
 1. [Drop the rows where at least one element is missing](#6)
 2. [Drop the columns where atleast one element is missing](#7)
 3. [Drop the rows where all the elements are missing](#8)
 4. [Keep only the rows with atleast 2 non-NA values](#9)
 5. [Define in which columns to look for missing values](#10)
 6. [Keep the DataFrame with valid entries in the same variable](#11)
* [Convert data types](#12)
 1. [Cast all the columns to one data type](#13)
 2. [Cast single column to data type](#14)
* [apply function](#15)
 1. [Method 1](#16)
 2. [Method 2](#17)

Let's get started!

[Data for daily news for stock market prediction](https://www.kaggle.com/aaron7sun/stocknews)

In [None]:
# import library
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import numpy as np
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
# import data
df = pd.read_csv('/kaggle/input/stocknews/upload_DJIA_table.csv')

In [None]:
# look at the top 5 rows in the data
df.head()

## Drop not-null Data<a id='1'></a>
Drop specified labels from rows or columns.

Remove rows or columns by specifying label names and corresponding axis, or by specifying directly index or column names. When using a multi-index, labels on different levels can be removed by specifying the level.

### 1. Drop columns<a id='2'></a>
Syntax 1: `df.drop(['column_name1', 'column_name2', ...], axis = 1)`

If you permanently want to drop column use `inplace = true`. The default value of inplace is False

Syntax 2 : `df.drop(columns = ['column_name1', 'column_name2', ...])`

In [None]:
# let's drop date column by syntax 1
df.drop(['Date'], axis = 1)

In [None]:
# let's drop open column by syntax 2
df.drop(columns = ['Open'])

In [None]:
# check for the condition if Dataframe
df.head()

### 2. Drop a row by index<a id='3'></a>
Syntax : `df.drop(['index'])`

In [None]:
df.drop([0, 1])

The first two rows are dropped.

### 3. Drop columns and/or rows of MultiIndex DataFrame<a id='4'></a>
Syntax 1: `df.drop(index = 'first_index', columns = 'column_name')`

Syntax 2 : `df.drop(index = 'second_index', level = 1)`

In [None]:
# let's make a multiIndex Dataframe
midx = pd.MultiIndex(levels=[['lama', 'cow', 'falcon'],
                             ['speed', 'weight', 'length']],
                     codes=[[0, 0, 0, 1, 1, 1, 2, 2, 2],
                            [0, 1, 2, 0, 1, 2, 0, 1, 2]])
df1 = pd.DataFrame(index=midx, columns=['big', 'small'],
                  data=[[45, 30], [200, 100], [1.5, 1], [30, 20],
                        [250, 150], [1.5, 0.8], [320, 250],
                        [1, 0.8], [0.3, 0.2]])
df1

In [None]:
# to drop the whole 'cow' row and 'small' column
df1.drop(index='cow', columns='small')

In [None]:
# to drop the subindex 'length'
df1.drop(index='length', level=1)

## Drop null data<a id='5'></a>
Let's see how to remove missing values from tha dataset. 

**Note :** We don't have any missing values in this dataset. Therefore we'll use our own data that has missing values.

In [None]:
df2 = pd.DataFrame({"name": ['Alfred', 'Batman', 'Catwoman'],
                   "toy": [np.nan, 'Batmobile', 'Bullwhip'],
                   "born": [pd.NaT, "1940-04-25",
                            pd.NaT]})
df2

### 1. Drop the rows where at least one element is missing<a id='6'></a>
Syntax : `df.dropna()`

In [None]:
df2.dropna()

### 2. Drop the columns where atleast one element is missing<a id='7'></a>
Syntax : `df.dropna(axis = 'columns')`

In [None]:
df2.dropna(axis = 'columns')

### 3. Drop the rows where all the elements are missing<a id='8'></a>
Syntax : `df.dropna(how = 'all')`

In [None]:
df2.dropna(how = 'all')

### 4. Keep only the rows with atleast 2 non-NA values<a id='9'></a>
Syntax : `df.dropna(thresh = 2)`

In [None]:
df2.dropna(thresh = 2)

### 5. Define in which columns to look for missing values<a id='10'></a>
Syntax : `df.dropna(subset = ['column_name'])`

In [None]:
df2.dropna(subset=['name', 'born'])

### 6. Keep the DataFrame with valid entries in the same variable<a id='11'></a>

In [None]:
df2.dropna(inplace = True)
df2

## Convert Data types<a id='12'></a>

In [None]:
# let's see the datatype of each column
df.dtypes

### 1. Cast all the columns to one data type<a id='13'></a>
Syntax : `df.astype('data_type').dtypes`

In [None]:
d = {'col1': [1, 2], 'col2': [3, 4]}
df3 = pd.DataFrame(data=d)
df3.dtypes

In [None]:
df3.astype('int32').dtypes

### 2. Cast single column to data type<a id='14'></a>
Syntax : `df.astype({'column_name' : 'datatype'}).dtypes`

In [None]:
df3.astype({'col1': 'float'}).dtypes

## Apply Function<a id='15'></a>
Pandas.apply allow the users to pass a function and apply it on every single value of the Pandas series. It comes as a huge improvement for the pandas library as this function helps to segregate data according to the conditions required due to which it is efficiently used in data science and machine learning.

Syntax : `s.apply(func, convert_dtype=True, args=())`

Parameters:

* **func:** .apply takes a function and applies it to all values of pandas series.
* **convert_dtype:** Convert dtype as per the function’s operation.
* **args=():** Additional arguments to pass to function instead of series.
* **Return Type:** Pandas Series after applied function/operation.

### Method 1<a id='16'></a>

In [None]:
def method1(x):
    return x * 2
df.Open.apply(method1).head(5)

### Method 2<a id='17'></a>

In [None]:
df.Open.apply(lambda x : x * 2).head()

Yeahhhh!!! Here we finish our tutorial in pandas! So far we have learnt many things. So keep practicing guys! 