# Pandas
<li>Pandas is an open-source Python package that is built on top of NumPy used for working with data sets.</li> 
<li>The name "Pandas" has a reference to <b>"Python Data Analysis".</b></li>
<li>Pandas is considered to be one of the best data-wrangling packages.</li>
<li>Pandas offers user-friendly, easy-to-use data structures and analysis tools for analyzing, cleaning, exploring and manipulating data.</li>
<li>It also functions well with various other data science Python modules.</li>


## Why Use Pandas?

<li>Pandas is known for its exceptional ability to represent and organize data.</li>
<li>The Pandas library was created to be able to work with large datasets faster and more efficiently than any other library.</li>
<li>It excels at analyzing huge amounts of data.Pandas allows us to analyze big data and make conclusions based on statistical theories.</li>
<li>Pandas can clean messy data sets, and make them readable and relevant.</li>
<li>By combining the functionality of Matplotlib and NumPy, Pandas offers users a powerful tool for performing <b>data analytics and visualization.</b></li>
<li>Data can be imported to Pandas from a variety of file formats, such as Csv, SQL, Excel, and JSON, among others.</li>
<li>Pandas is a versatile and marketable skill set for data analysts and data scientists that can gain the attention of employers.</li>


## Installation Of Pandas
<li>Go to your terminal, open and activate your virtual environment and then use the following commands for installing pandas.</li>

<code>
    pip install pandas
</code>

## Importing Pandas
<li>We need to import pandas if we want to create a pandas dataframe and perform any analysis on them.</li>
<li>We can import pandas package using the following command:</li>
<code>
    import pandas as pd
</code>

In [3]:
import pandas as pd

## How To Create A Pandas DataFrame
<li>A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, arranged in a table like structure with rows and columns.</li>
<li>We can create a basic pandas dataframe by various methods.</li>
<li>Let's discuss some of the methods to create the given dataframes:</li>

### 1. From Python Dictionary

In [4]:
data = {
    "name": ["naresh","ram"],
    "age": [24,25],
    "address": ["bkt","ktm"]
}

In [5]:
df = pd.DataFrame(data)
df

Unnamed: 0,name,age,address
0,naresh,24,bkt
1,ram,25,ktm


### 2. From a list of dictionaries

In [6]:
list_dic = [
    {
        "name":"Hari",
        "age": 21
        
    },
    {
      "name":"amisha",
        "age": 20
    }
]

In [7]:
list_df = pd.DataFrame(list_dic)
list_df

Unnamed: 0,name,age
0,Hari,21
1,amisha,20


### 3. From a list of tuples

In [8]:
list_tupe=[
    ("naresh", 23,"bkt"),
    ("megamind",34,"ktm")
]

In [9]:
tup_dic = pd.DataFrame(list_tupe)
tup_dic

Unnamed: 0,0,1,2
0,naresh,23,bkt
1,megamind,34,ktm


### 4. From list of lists

In [10]:
nested_list = [[
    "naresh",23,"ktm"
],
              ['megamind',24,"bkt"]]

In [11]:
nested_dic = pd.DataFrame(nested_list)
nested_dic

Unnamed: 0,0,1,2
0,naresh,23,ktm
1,megamind,24,bkt


# Question
1. Read 'imports-85.data' file using file reader.
2. Store the data present inside the file into a list of list.
3. Create a pandas dataframe using list of lists.
4. For column name, we can use the columns variable given below.

In [12]:
import csv
with open("data/imports-85.data") as file:
    reader = csv.reader(file)
    data_list = list(reader)


In [13]:
_data_df = pd.DataFrame(data_list)
_data_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,16,17,18,19,20,21,22,23,24,25
0,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,?,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
3,2,164,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950
4,2,164,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450


In [14]:
columns = ['symboling', 'normalized_losses', 'make', 'fuel_type', 'aspiration', 'num_of_doors',
          'body_style', 'drive_wheels', 'engine_location', 'wheel_base', 'length', 'width', 
           'height', 'curb_weight', 'engine_type', 'num_of_cylinders', 'engine_size', 'fuel_system',
          'bore', 'stroke', 'compression', 'horsepower', 'peak_rpm', 'city_mpg', 'highway_mpg', 
           'price']

In [15]:
_data_df.columns = columns
_data_df

Unnamed: 0,symboling,normalized_losses,make,fuel_type,aspiration,num_of_doors,body_style,drive_wheels,engine_location,wheel_base,...,engine_size,fuel_system,bore,stroke,compression,horsepower,peak_rpm,city_mpg,highway_mpg,price
0,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.60,...,130,mpfi,3.47,2.68,9.00,111,5000,21,27,13495
1,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.60,...,130,mpfi,3.47,2.68,9.00,111,5000,21,27,16500
2,1,?,alfa-romero,gas,std,two,hatchback,rwd,front,94.50,...,152,mpfi,2.68,3.47,9.00,154,5000,19,26,16500
3,2,164,audi,gas,std,four,sedan,fwd,front,99.80,...,109,mpfi,3.19,3.40,10.00,102,5500,24,30,13950
4,2,164,audi,gas,std,four,sedan,4wd,front,99.40,...,136,mpfi,3.19,3.40,8.00,115,5500,18,22,17450
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
200,-1,95,volvo,gas,std,four,sedan,rwd,front,109.10,...,141,mpfi,3.78,3.15,9.50,114,5400,23,28,16845
201,-1,95,volvo,gas,turbo,four,sedan,rwd,front,109.10,...,141,mpfi,3.78,3.15,8.70,160,5300,19,25,19045
202,-1,95,volvo,gas,std,four,sedan,rwd,front,109.10,...,173,mpfi,3.58,2.87,8.80,134,5500,18,23,21485
203,-1,95,volvo,diesel,turbo,four,sedan,rwd,front,109.10,...,145,idi,3.01,3.40,23.00,106,4800,26,27,22470


### 5. Pandas Dataframe From Csv files

<li>We can load a csv file and create a dataframe out of the data present inside a csv file using pandas.</li>
<li>We have <b>.read_csv()</b> method to read a csv file and create a pandas dataframe from the dataset.</li>

In [16]:
weather_df = pd.read_csv("data/weather_data.csv", names =["day","temperature","windspeed","event"])
weather_df.head()

Unnamed: 0,day,temperature,windspeed,event
0,kfjkdfjskd,,,
1,dfuhsdjufio,,,
2,day,temperature,windspeed,event
3,1/1/2017,32,6,Rain
4,1/4/2017,not available,9,Sunny


# Reading a csv file using skiprows and header parameters

In [17]:
weather_df = pd.read_csv("data/weather_data.csv", skiprows=2)
weather_df.head()

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32,6,Rain
1,1/4/2017,not available,9,Sunny
2,1/5/2017,-1,not measured,Snow
3,1/6/2017,not available,7,no event
4,1/7/2017,32,not measured,Rain


In [18]:
weather_df = pd.read_csv("data/weather_data.csv", header=2)
weather_df.head()

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32,6,Rain
1,1/4/2017,not available,9,Sunny
2,1/5/2017,-1,not measured,Snow
3,1/6/2017,not available,7,no event
4,1/7/2017,32,not measured,Rain


# Reading a csv file without header and giving names to the columns

In [19]:
weather_df_header = pd.read_csv("data/weather_data.csv", skiprows=3, header=None , names=['date',"temperature","windspeed","event"])
weather_df_header.head()

Unnamed: 0,date,temperature,windspeed,event
0,1/1/2017,32,6,Rain
1,1/4/2017,not available,9,Sunny
2,1/5/2017,-1,not measured,Snow
3,1/6/2017,not available,7,no event
4,1/7/2017,32,not measured,Rain


# Read limited data from a csv file using nrows parameters

In [20]:
weather_df_header = pd.read_csv("data/weather_data.csv", skiprows=3, header=None , names=['date',"temperature","windspeed","event"],nrows=8)
weather_df_header

Unnamed: 0,date,temperature,windspeed,event
0,1/1/2017,32,6,Rain
1,1/4/2017,not available,9,Sunny
2,1/5/2017,-1,not measured,Snow
3,1/6/2017,not available,7,no event
4,1/7/2017,32,not measured,Rain
5,1/8/2017,not available,not measured,Sunny
6,1/9/2017,not available,not measured,no event
7,1/10/2017,34,8,Cloudy


In [21]:
weather_df_header = pd.read_csv("data/weather_data.csv", skiprows=3, header=None , names=['date',"temperature","windspeed","event"],nrows=5)
weather_df_header

Unnamed: 0,date,temperature,windspeed,event
0,1/1/2017,32,6,Rain
1,1/4/2017,not available,9,Sunny
2,1/5/2017,-1,not measured,Snow
3,1/6/2017,not available,7,no event
4,1/7/2017,32,not measured,Rain


# Reading csv files with na_values parameters ('weather_data.csv' file)

In [22]:
weather_df_header = pd.read_csv("data/weather_data.csv", skiprows=3, header=None , names=['date',"temperature","windspeed","event"],nrows=5,na_values=["not available","not measured","no event"])
weather_df_header

Unnamed: 0,date,temperature,windspeed,event
0,1/1/2017,32.0,6.0,Rain
1,1/4/2017,,9.0,Sunny
2,1/5/2017,-1.0,,Snow
3,1/6/2017,,7.0,
4,1/7/2017,32.0,,Rain


# Write a pandas dataframe to a csv file
1. We can write a pandas dataframe to a csv file using .to_csv() method.
2. You can specify any name to the csv file while writing a pandas dataframe into a csv file.

In [23]:
weather_df_header.to_csv("weather_nan_data.csv", index=False)

In [24]:
nan_df = pd.read_csv("weather_nan_data.csv")
nan_df

Unnamed: 0,date,temperature,windspeed,event
0,1/1/2017,32.0,6.0,Rain
1,1/4/2017,,9.0,Sunny
2,1/5/2017,-1.0,,Snow
3,1/6/2017,,7.0,
4,1/7/2017,32.0,,Rain


### 6. Pandas Dataframe From Xcel files
* We can load an excel file with .xlsx extension and create a dataframe out of the data present inside an excel file using pandas.
* We have .read_excel() method to read a csv file and create a pandas dataframe from the dataset.
* We also need to install openpyxl for working with excel files.

In [25]:
! pip install openpyxl




[notice] A new release of pip available: 22.3.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


In [26]:
import pandas as pd

In [27]:
xl_df = pd.read_excel("data/weather_data.xlsx")

In [28]:
xl_df

Unnamed: 0.1,Unnamed: 0,day,temperature,windspeed,event
0,0,1/1/2017,32.0,6.0,Rain
1,1,1/4/2017,,9.0,Sunny
2,2,1/5/2017,-1.0,,Snow
3,3,1/6/2017,,7.0,
4,4,1/7/2017,32.0,,Rain
5,5,1/8/2017,,,Sunny
6,6,1/9/2017,,,
7,7,1/10/2017,34.0,8.0,Cloudy
8,8,1/11/2017,-4.0,,Snow
9,9,1/12/2017,26.0,12.0,Sunny


In [29]:
type(xl_df)

pandas.core.frame.DataFrame

In [30]:
xl_df.columns

Index(['Unnamed: 0', 'day', 'temperature', 'windspeed', 'event'], dtype='object')

In [31]:
xl_df.reset_index(drop=True, inplace=True)

In [32]:
xl_df

Unnamed: 0.1,Unnamed: 0,day,temperature,windspeed,event
0,0,1/1/2017,32.0,6.0,Rain
1,1,1/4/2017,,9.0,Sunny
2,2,1/5/2017,-1.0,,Snow
3,3,1/6/2017,,7.0,
4,4,1/7/2017,32.0,,Rain
5,5,1/8/2017,,,Sunny
6,6,1/9/2017,,,
7,7,1/10/2017,34.0,8.0,Cloudy
8,8,1/11/2017,-4.0,,Snow
9,9,1/12/2017,26.0,12.0,Sunny


In [33]:
# to remove column= "unnamed"
xl_df.drop(columns={"Unnamed: 0"} , inplace=True)

In [34]:
xl_df

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32.0,6.0,Rain
1,1/4/2017,,9.0,Sunny
2,1/5/2017,-1.0,,Snow
3,1/6/2017,,7.0,
4,1/7/2017,32.0,,Rain
5,1/8/2017,,,Sunny
6,1/9/2017,,,
7,1/10/2017,34.0,8.0,Cloudy
8,1/11/2017,-4.0,,Snow
9,1/12/2017,26.0,12.0,Sunny


# Writing to an excel file
* We can write a pandas dataframe into a excel file using .to_excel() method.

In [35]:
xl_df.to_excel("weather_nan_data.xlsx", index= False)

In [36]:
# reading weather_nan_data.xlsx
df = pd.read_excel("weather_nan_data.xlsx")
df.head()

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32.0,6.0,Rain
1,1/4/2017,,9.0,Sunny
2,1/5/2017,-1.0,,Snow
3,1/6/2017,,7.0,
4,1/7/2017,32.0,,Rain


#### Using head() and tail() method to see top 5 and last 5 rows
<li>To view the first few rows of our dataframe, we can use the DataFrame.head() method.</li>
<li>By default, it returns the first five rows of our dataframe.</li>
<li>However, it also accepts an optional integer parameter, which specifies the number of rows.</li>

<li>Similarly, to view the last few rows of our dataframe, we can use the DataFrame.tail() method.</li>
<li>By default, it returns the last five rows of our dataframe.</li>
<li>However, it also accepts an optional integer parameter, which specifies the number of rows.</li>

#### Question:

<li>Use the head() method to select the first 6 rows.</li>
<li>Use the tail() method to select the last 8 rows.</li>

In [37]:
# Use the head() method to select the first 6 rows.
df = pd.read_excel("weather_nan_data.xlsx")
df.head(6)

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32.0,6.0,Rain
1,1/4/2017,,9.0,Sunny
2,1/5/2017,-1.0,,Snow
3,1/6/2017,,7.0,
4,1/7/2017,32.0,,Rain
5,1/8/2017,,,Sunny


In [38]:
# Use the tail() method to select the last 8 rows.
df.tail(8)

Unnamed: 0,day,temperature,windspeed,event
5,1/8/2017,,,Sunny
6,1/9/2017,,,
7,1/10/2017,34.0,8.0,Cloudy
8,1/11/2017,-4.0,,Snow
9,1/12/2017,26.0,12.0,Sunny
10,1/13/2017,12.0,12.0,Rainy
11,1/11/2017,-1.0,12.0,Snow
12,1/14/2017,40.0,,Sunny


#### Finding the column names from the dataframe
<li>We have df.columns attributes to check the name of columns in the pandas dataframe.</li>
<li>Similarly, we have df.values attributes to check the data present in the pandas dataframe.</li>
<li>Check columns type , slicing, values, type of values,shape, dimension, print no event,not measured,not available </li>

In [39]:
# We have df.columns attributes to check the name of columns in the pandas dataframe.
df.columns

Index(['day', 'temperature', 'windspeed', 'event'], dtype='object')

In [40]:
# we have df.values attributes to check the data present in the pandas dataframe.
df.values

array([['1/1/2017', 32.0, 6.0, 'Rain'],
       ['1/4/2017', nan, 9.0, 'Sunny'],
       ['1/5/2017', -1.0, nan, 'Snow'],
       ['1/6/2017', nan, 7.0, nan],
       ['1/7/2017', 32.0, nan, 'Rain'],
       ['1/8/2017', nan, nan, 'Sunny'],
       ['1/9/2017', nan, nan, nan],
       ['1/10/2017', 34.0, 8.0, 'Cloudy'],
       ['1/11/2017', -4.0, nan, 'Snow'],
       ['1/12/2017', 26.0, 12.0, 'Sunny'],
       ['1/13/2017', 12.0, 12.0, 'Rainy'],
       ['1/11/2017', -1.0, 12.0, 'Snow'],
       ['1/14/2017', 40.0, nan, 'Sunny']], dtype=object)

In [41]:
# Check columns type , slicing, values, type of values,shape, dimension, print no event,not measured,not available
print("Checking column type:")
print(f"Type of column is {type(df.columns)}")
print("_______________________________________")
print("Checking the type of  values:")
print(f"Type of values is {type(df.values)}")
print("_______________________________________")
print("Checking the shape of  DataFrame:")
print(f"The shape of DataFrame:{df.shape}")
print("_______________________________________")
print("Checking the dimension of  DataFrame:")
print(f"The dimension of DataFrame:{df.ndim}")
print("_______________________________________")

Checking column type:
Type of column is <class 'pandas.core.indexes.base.Index'>
_______________________________________
Checking the type of  values:
Type of values is <class 'numpy.ndarray'>
_______________________________________
Checking the shape of  DataFrame:
The shape of DataFrame:(13, 4)
_______________________________________
Checking the dimension of  DataFrame:
The dimension of DataFrame:2
_______________________________________


In [42]:
# slicing
df.loc[:,["day","temperature"]]

Unnamed: 0,day,temperature
0,1/1/2017,32.0
1,1/4/2017,
2,1/5/2017,-1.0
3,1/6/2017,
4,1/7/2017,32.0
5,1/8/2017,
6,1/9/2017,
7,1/10/2017,34.0
8,1/11/2017,-4.0
9,1/12/2017,26.0


In [43]:
df.loc[:,"day":"event"]

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32.0,6.0,Rain
1,1/4/2017,,9.0,Sunny
2,1/5/2017,-1.0,,Snow
3,1/6/2017,,7.0,
4,1/7/2017,32.0,,Rain
5,1/8/2017,,,Sunny
6,1/9/2017,,,
7,1/10/2017,34.0,8.0,Cloudy
8,1/11/2017,-4.0,,Snow
9,1/12/2017,26.0,12.0,Sunny


In [44]:
df.iloc[:,0:3]

Unnamed: 0,day,temperature,windspeed
0,1/1/2017,32.0,6.0
1,1/4/2017,,9.0
2,1/5/2017,-1.0,
3,1/6/2017,,7.0
4,1/7/2017,32.0,
5,1/8/2017,,
6,1/9/2017,,
7,1/10/2017,34.0,8.0
8,1/11/2017,-4.0,
9,1/12/2017,26.0,12.0


In [45]:
#  print no event,not measured,not available
df = pd.read_csv("data/weather_data.csv", skiprows=2)
df.head()

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32,6,Rain
1,1/4/2017,not available,9,Sunny
2,1/5/2017,-1,not measured,Snow
3,1/6/2017,not available,7,no event
4,1/7/2017,32,not measured,Rain


#### Checking the type of your dataframe 
<li>Another feature that makes pandas better for working with data is that dataframes can contain more than one data type.</li>
<li>Axis values can have string labels, not just numeric ones.</li>
<li>Dataframes can contain columns with multiple data types: including integer, float, and string.</li>
<li>We can use the DataFrame.dtypes attribute (similar to NumPy) to return information about the types of each column.</li>
<li>When we import data, pandas attempts to guess the correct dtype for each column.</li>
<li>Generally, pandas does well with this, which means we don't need to worry about specifying dtypes every time we start to work with data.</li>



In [46]:
df.dtypes

day            object
temperature    object
windspeed      object
event          object
dtype: object

#### Datatypes Information
<li>We can get the shape of the dataset using <b>.shape()</b> method.</li>
<li><b>.shape()</b> method returns the tuple datatype containing the number of rows and number of columns in the dataset.</li>
<li>If we wanted an overview of all the dtypes used in our dataframe, we can use <b>.info()</b> method.</li>
<li>Note that <b>DataFrame.info()</b> prints the information, rather than returning it, so we can't assign it to a variable.</li>


In [47]:
# .shape
df.shape

(13, 4)

In [48]:
# .info()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13 entries, 0 to 12
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   day          13 non-null     object
 1   temperature  13 non-null     object
 2   windspeed    13 non-null     object
 3   event        13 non-null     object
dtypes: object(4)
memory usage: 544.0+ bytes


In [49]:
# reading nan weather data
n_df = pd.read_csv("weather_nan_data.csv")
n_df

Unnamed: 0,date,temperature,windspeed,event
0,1/1/2017,32.0,6.0,Rain
1,1/4/2017,,9.0,Sunny
2,1/5/2017,-1.0,,Snow
3,1/6/2017,,7.0,
4,1/7/2017,32.0,,Rain


In [50]:
# shape of nan weather data
n_df.shape

(5, 4)

In [51]:
# .info()
n_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   date         5 non-null      object 
 1   temperature  3 non-null      float64
 2   windspeed    3 non-null      float64
 3   event        4 non-null      object 
dtypes: float64(2), object(2)
memory usage: 288.0+ bytes


#### Checking the null values in the pandas dataframe

In [52]:
# checking null values using .isnull()
n_df.isnull()

Unnamed: 0,date,temperature,windspeed,event
0,False,False,False,False
1,False,True,False,False
2,False,False,True,False
3,False,True,False,True
4,False,False,True,False


In [53]:
# checking null values using .isna()
n_df.isna()


Unnamed: 0,date,temperature,windspeed,event
0,False,False,False,False
1,False,True,False,False
2,False,False,True,False
3,False,True,False,True
4,False,False,True,False


In [54]:
# find the number of null values in DataFrame
n_df.isna().sum()

date           0
temperature    2
windspeed      2
event          1
dtype: int64

# here, columns temperature and windspeed has two missing values , event has one missing values where there is no missing values in date

# set_index() and reset_index() method

In [55]:
# set_index()
n_df.set_index(keys=['date','event'])

Unnamed: 0_level_0,Unnamed: 1_level_0,temperature,windspeed
date,event,Unnamed: 2_level_1,Unnamed: 3_level_1
1/1/2017,Rain,32.0,6.0
1/4/2017,Sunny,,9.0
1/5/2017,Snow,-1.0,
1/6/2017,,,7.0
1/7/2017,Rain,32.0,


# here, set_index()  methed are use to set index by providing keys parameters(keys parameter should contains in DataFrame.columns)

In [56]:
n_df.set_index(keys='index') 

KeyError: "None of ['index'] are in the columns"

In [57]:
# reset_index()
n_df


Unnamed: 0,date,temperature,windspeed,event
0,1/1/2017,32.0,6.0,Rain
1,1/4/2017,,9.0,Sunny
2,1/5/2017,-1.0,,Snow
3,1/6/2017,,7.0,
4,1/7/2017,32.0,,Rain


In [58]:
n_df.set_index(keys='date',inplace=True)
n_df

Unnamed: 0_level_0,temperature,windspeed,event
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1/1/2017,32.0,6.0,Rain
1/4/2017,,9.0,Sunny
1/5/2017,-1.0,,Snow
1/6/2017,,7.0,
1/7/2017,32.0,,Rain


In [59]:
# resetting the index
n_df.reset_index(drop=True, inplace=True)
n_df

Unnamed: 0,temperature,windspeed,event
0,32.0,6.0,Rain
1,,9.0,Sunny
2,-1.0,,Snow
3,,7.0,
4,32.0,,Rain


#### Selecting a column from a pandas DataFrame

<li>Since our axis in pandas have labels, we can select data using those labels.</li> 
<li>Unlike in NumPy, we donot need to know the exact index location of a pandas dataframe.</li>
<li>To do this, we can use the DataFrame.loc[] attribute. The syntax for DataFrame.loc[] is:</li>
<code>
df.loc[row_label, column_label]
</code>

<li>We can use the following shortcut to select a single column:</li>
<code>
df["column_name"]
</code>

<li>This style of selecting columns is very common.</li>


In [60]:
weather_df = pd.read_csv("data/weather_data.csv", names=['day','temperature','windspeed','event'])
weather_df

Unnamed: 0,day,temperature,windspeed,event
0,kfjkdfjskd,,,
1,dfuhsdjufio,,,
2,day,temperature,windspeed,event
3,1/1/2017,32,6,Rain
4,1/4/2017,not available,9,Sunny
5,1/5/2017,-1,not measured,Snow
6,1/6/2017,not available,7,no event
7,1/7/2017,32,not measured,Rain
8,1/8/2017,not available,not measured,Sunny
9,1/9/2017,not available,not measured,no event


In [61]:
weather_df.dropna(inplace=True)

In [62]:
weather_df.drop(index=2, inplace=True)
weather_df

Unnamed: 0,day,temperature,windspeed,event
3,1/1/2017,32,6,Rain
4,1/4/2017,not available,9,Sunny
5,1/5/2017,-1,not measured,Snow
6,1/6/2017,not available,7,no event
7,1/7/2017,32,not measured,Rain
8,1/8/2017,not available,not measured,Sunny
9,1/9/2017,not available,not measured,no event
10,1/10/2017,34,8,Cloudy
11,1/11/2017,-4,-1,Snow
12,1/12/2017,26,12,Sunny


In [63]:
weather_df.reset_index(drop=True, inplace=True)
weather_df.head()

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32,6,Rain
1,1/4/2017,not available,9,Sunny
2,1/5/2017,-1,not measured,Snow
3,1/6/2017,not available,7,no event
4,1/7/2017,32,not measured,Rain


In [64]:
weather_df.loc[:,"event"]

0         Rain
1        Sunny
2         Snow
3     no event
4         Rain
5        Sunny
6     no event
7       Cloudy
8         Snow
9        Sunny
10       Rainy
11        Snow
12       Sunny
Name: event, dtype: object

In [65]:
weather_df["event"]

0         Rain
1        Sunny
2         Snow
3     no event
4         Rain
5        Sunny
6     no event
7       Cloudy
8         Snow
9        Sunny
10       Rainy
11        Snow
12       Sunny
Name: event, dtype: object

In [66]:
weather_df.iloc[:,3:]

Unnamed: 0,event
0,Rain
1,Sunny
2,Snow
3,no event
4,Rain
5,Sunny
6,no event
7,Cloudy
8,Snow
9,Sunny


#### Questions

<li>Read <b>'appointment_schedule.csv'</b> file using pandas.</li>
<li>Select the <b>'name'</b> column from the given dataset and store to <b>'appointment_names'</b> variable.</li>
<li>Use Python's <b>type()</b> function to assign the type of name column to <b>name_type</b>.</li>

In [67]:
# Read 'appointment_schedule.csv' file using pandas.
schedule_df = pd.read_csv("data/appointment_schedule.csv")
schedule_df.head()

Unnamed: 0,name,appointment_made_date,app_start_date,app_end_date,visitee_namelast,visitee_namefirst,meeting_room,description
0,Joshua T. Blanton,2014-12-18T00:00:00,1/6/15 9:30,1/6/15 23:59,,potus,west wing,JointService Military Honor Guard
1,Jack T. Gutting,2014-12-18T00:00:00,1/6/15 9:30,1/6/15 23:59,,potus,west wing,JointService Military Honor Guard
2,Bradley T. Guiles,2014-12-18T00:00:00,1/6/15 9:30,1/6/15 23:59,,potus,west wing,JointService Military Honor Guard
3,Loryn F. Grieb,2014-12-18T00:00:00,1/6/15 9:30,1/6/15 23:59,,potus,west wing,JointService Military Honor Guard
4,Travis D. Gordon,2014-12-18T00:00:00,1/6/15 9:30,1/6/15 23:59,,potus,west wing,JointService Military Honor Guard


In [68]:
# Select the 'name' column from the given dataset and store to 'appointment_names' variable.
schedule_name = schedule_df["name"]
schedule_name

0        Joshua T. Blanton
1          Jack T. Gutting
2        Bradley T. Guiles
3           Loryn F. Grieb
4         Travis D. Gordon
              ...         
580         Ryan J. Morgan
581    Alexander V. Nevsky
582     Montana J. Johnson
583    Joseph A. Pritchard
584        Martin O. Reina
Name: name, Length: 585, dtype: object

In [69]:
# Use Python's type() function to assign the type of name column to name_type.
print(f"The type of schedule name : {type(schedule_name)}")

The type of schedule name : <class 'pandas.core.series.Series'>


#### Pandas Series
<li>Series is the pandas type for one-dimensional objects.</li>
<li>Anytime you see a 1D pandas object, it will be a series. Anytime you see a 2D pandas object, it will be a dataframe.</li>
<li>A dataframe is a collection of series objects, which is similar to how pandas stores the data behind the scenes.</li>

# Adding a column in a pandas dataframe

In [70]:
n_df

Unnamed: 0,temperature,windspeed,event
0,32.0,6.0,Rain
1,,9.0,Sunny
2,-1.0,,Snow
3,,7.0,
4,32.0,,Rain


In [71]:
import numpy as np
n_df.insert(loc=0, column='id', value=np.array([1, 2, 3, 4, 5]))

### Selecting Multiple Columns From the DataFrame

![](images/selecting_columns.png)

<li>We can select multiple columns from the dataframe by using the following codes:</li>
<code>
    df.loc[:, ["col1", "col2"]]
</code>

<li>We can use syntax shortcuts for selecting multiple columns by using the following syntax:</li>
<code>
    df[["col1", "col2"]]
</code>

In [72]:
car_detail = pd.read_csv("data/car_details.csv")
car_detail.head()

Unnamed: 0,name,year,selling_price,km_driven,fuel,seller_type,transmission,owner
0,Maruti 800 AC,2007,60000,70000,Petrol,Individual,Manual,First Owner
1,Maruti Wagon R LXI Minor,2007,135000,50000,Petrol,Individual,Manual,First Owner
2,Hyundai Verna 1.6 SX,2012,600000,100000,Diesel,Individual,Manual,First Owner
3,Datsun RediGO T Option,2017,250000,46000,Petrol,Individual,Manual,First Owner
4,Honda Amaze VX i-DTEC,2014,450000,141000,Diesel,Individual,Manual,Second Owner


In [73]:
car_detail.loc[:,["fuel","owner"]]

Unnamed: 0,fuel,owner
0,Petrol,First Owner
1,Petrol,First Owner
2,Diesel,First Owner
3,Petrol,First Owner
4,Diesel,Second Owner
...,...,...
4335,Diesel,Second Owner
4336,Diesel,Second Owner
4337,Petrol,Second Owner
4338,Diesel,First Owner


In [74]:
car_detail[['fuel','owner']]

Unnamed: 0,fuel,owner
0,Petrol,First Owner
1,Petrol,First Owner
2,Diesel,First Owner
3,Petrol,First Owner
4,Diesel,Second Owner
...,...,...
4335,Diesel,Second Owner
4336,Diesel,Second Owner
4337,Petrol,Second Owner
4338,Diesel,First Owner


#### Selecting Rows From A Pandas DataFrame

<li>Now that we've learned how to select columns by label, let's learn how to select rows using the labels of the index axis.</li>
<li>We can use the same syntax to select rows from a dataframe as we do for columns:</li>
<code>
    df.loc[row_label, column_label]

#### Indexing & Slicing In Pandas DataFrame

<li>We can slice a dataset from their rows as well as columns.</li>
<li>If we have (5,5) shape data and we want first three rows and first three columns then we need to slice both rows and columns to get a desired shape.</li>
<li>We have df.iloc() method which we can use to do indexing as well as slicing in a dataframe.</li>
<li>Let's practice .iloc() method.</li>


In [75]:
car_detail.iloc[:,1:4]

Unnamed: 0,year,selling_price,km_driven
0,2007,60000,70000
1,2007,135000,50000
2,2012,600000,100000
3,2017,250000,46000
4,2014,450000,141000
...,...,...,...
4335,2014,409999,80000
4336,2014,409999,80000
4337,2009,110000,83000
4338,2016,865000,90000


#### Datatype Conversion In Pandas

<li>Pandas astype() is the one of the most important methods. It is used to change data type of a series.</li>
<li>When a pandas dataframe is created from a csv file,the data type is set automatically.</li>
<li>The datatype will not be what it actually should be at times and this is where we can use astype()  to get desired datatype.</li>
<li>For example, a salary column could be imported as string but to do operations we have to convert it into float.</li>
<li>astype() is used to do such data type conversions.</li>

In [76]:
car_detail["selling_price"].dtype

dtype('int64')

In [77]:
selling_price = car_detail["selling_price"]
s_p = selling_price.astype(dtype=float)

#### Value Counts Method

<li>Since series and dataframes are two distinct objects, they have their own unique methods.</li>

<li>Let's look at an example of a series method - the Series.value_counts() method.</li>

<li>This method displays each unique non-null value in a column and their counts in order.</li>

<li>value_counts() is a series only method, we get the following error if we try to use it for dataframes:</li>

<code>
    AttributeError: 'DataFrame' object has no attribute 'value_counts'

# Creating a frequency table from value_counts

In [78]:
s_p.value_counts()

selling_price
300000.0     162
250000.0     125
350000.0     122
550000.0     107
600000.0     103
            ... 
2100000.0      1
828999.0       1
1119000.0      1
746000.0       1
865000.0       1
Name: count, Length: 445, dtype: int64

# Renaming the column names in a pandas dataframe

In [79]:
car_detail.rename(columns ={"selling_price":"s_p"})

Unnamed: 0,name,year,s_p,km_driven,fuel,seller_type,transmission,owner
0,Maruti 800 AC,2007,60000,70000,Petrol,Individual,Manual,First Owner
1,Maruti Wagon R LXI Minor,2007,135000,50000,Petrol,Individual,Manual,First Owner
2,Hyundai Verna 1.6 SX,2012,600000,100000,Diesel,Individual,Manual,First Owner
3,Datsun RediGO T Option,2017,250000,46000,Petrol,Individual,Manual,First Owner
4,Honda Amaze VX i-DTEC,2014,450000,141000,Diesel,Individual,Manual,Second Owner
...,...,...,...,...,...,...,...,...
4335,Hyundai i20 Magna 1.4 CRDi (Diesel),2014,409999,80000,Diesel,Individual,Manual,Second Owner
4336,Hyundai i20 Magna 1.4 CRDi,2014,409999,80000,Diesel,Individual,Manual,Second Owner
4337,Maruti 800 AC BSIII,2009,110000,83000,Petrol,Individual,Manual,Second Owner
4338,Hyundai Creta 1.6 CRDi SX Option,2016,865000,90000,Diesel,Individual,Manual,First Owner


#### Selecting Items From A Series Method

<li>As with dataframes, we can use Series.loc[] to select items from a series using single labels, a list, or a slice object.</li>
<li>We can also omit loc[] and use bracket shortcuts for all three:</li>

In [80]:
s_p.loc[1:5]

1    135000.0
2    600000.0
3    250000.0
4    450000.0
5    140000.0
Name: selling_price, dtype: float64

In [81]:
s_p.iloc[:5]

0     60000.0
1    135000.0
2    600000.0
3    250000.0
4    450000.0
Name: selling_price, dtype: float64

#### Question

<li>Use the value counts method to check the frequency count of different names from 'appointment_schedule.csv' file.</li>
<li>Select only first row from the series.</li>
<li>Select the first row and the last row from the series.</li>
<li>Select the first five rows and the last five rows from the series.</li>



In [82]:
schedule_df = pd.read_csv("data/appointment_schedule.csv")
schedule_df.head()

Unnamed: 0,name,appointment_made_date,app_start_date,app_end_date,visitee_namelast,visitee_namefirst,meeting_room,description
0,Joshua T. Blanton,2014-12-18T00:00:00,1/6/15 9:30,1/6/15 23:59,,potus,west wing,JointService Military Honor Guard
1,Jack T. Gutting,2014-12-18T00:00:00,1/6/15 9:30,1/6/15 23:59,,potus,west wing,JointService Military Honor Guard
2,Bradley T. Guiles,2014-12-18T00:00:00,1/6/15 9:30,1/6/15 23:59,,potus,west wing,JointService Military Honor Guard
3,Loryn F. Grieb,2014-12-18T00:00:00,1/6/15 9:30,1/6/15 23:59,,potus,west wing,JointService Military Honor Guard
4,Travis D. Gordon,2014-12-18T00:00:00,1/6/15 9:30,1/6/15 23:59,,potus,west wing,JointService Military Honor Guard


# Use the value counts method to check the frequency count of different names from 'appointment_schedule.csv' file.

In [83]:
schedule_df['name'].describe(include="object")

count                    585
unique                   542
top       Jesus MurilloKaram
freq                       3
Name: name, dtype: object

In [84]:
schedule_df['name'].value_counts()

name
Jesus MurilloKaram            3
Michael A. Marr               2
JoseAntonio MeadeKuribrena    2
Todd S. Mizis                 2
Kieffer T. Elkins             2
                             ..
Anthony J. Falsone            1
Robert C. Buford              1
Philip Coles                  1
Kristopher L. Davis           1
Joseph A. Pritchard           1
Name: count, Length: 542, dtype: int64

# Select only first row from the series.

In [85]:
schedule_df['name'].loc[0]

'Joshua T. Blanton'

# Select the first row and the last row from the series.

In [86]:
schedule_df['name'].loc[0:len(schedule_df['name']):len(schedule_df['name'])-1]

0      Joshua T. Blanton
584      Martin O. Reina
Name: name, dtype: object

# Select the first five rows and the last five rows from the series.

In [87]:
# first_five
schedule_df['name'].head(5)

0    Joshua T. Blanton
1      Jack T. Gutting
2    Bradley T. Guiles
3       Loryn F. Grieb
4     Travis D. Gordon
Name: name, dtype: object

In [88]:
# last_five
schedule_df['name'].tail(5)

580         Ryan J. Morgan
581    Alexander V. Nevsky
582     Montana J. Johnson
583    Joseph A. Pritchard
584        Martin O. Reina
Name: name, dtype: object

In [89]:
# another approach

step=1
for index in range(len(schedule_df['name'])):
    if index == 5:
        step = len(schedule_df)-5
        for i in range(step,len(schedule_df)):
            print(schedule_df['name'].loc[i])
        break
    else:
        print(schedule_df['name'].loc[index])


Joshua T. Blanton
Jack T. Gutting
Bradley T. Guiles
Loryn F. Grieb
Travis D. Gordon
Ryan J. Morgan
Alexander V. Nevsky
Montana J. Johnson
Joseph A. Pritchard
Martin O. Reina


#### DataFrame Vs DataSeries
#### Vecotrized Operations In Pandas

<li>We'll explore how pandas uses many of the concepts we learned in the NumPy.</li>
<li>Because pandas is designed to operate like NumPy, a lot of concepts and methods from Numpy are supported.</li>
<li>Recall that one of the ways NumPy makes working with data easier is with vectorized operations.</li>
<li>Just like with NumPy, we can use any of the standard Python numeric operators with series, including:</li>
<code>
    series_a + series_b - Addition
    series_a - series_b - Subtraction
    series_a * series_b - Multiplication
    series_a / series_b - Division
</code>

#### Some Statistical Functions In Pandas

<li>Like NumPy, Pandas supports many descriptive stats methods such as mean, median, mode, min, max and so on.</li>
<li>Here are a few of the most useful ones.</li>
<code>
Series.max()
Series.min()
Series.mean()
Series.median()
Series.mode()
Series.sum()
</code>
<li>We can calculate the average value of a particular column(series) using df.column_name.mean().</li>
<li>For calculating the minimum value in a particular column(series), we can use df.column_name.min().</li>
<li>Similarly, for calculating the maximum value in a particular column(series), we can use df.column_name.max().</li>

#### Finding the descriptive statistics of the dataframe using .describe() method

<li>Descriptive statistics include those that summarize the central tendency, dispersion and shape of a dataset's distribution, excluding NaN values.</li>
<li>describe() method in Pandas is used to compute descriptive statistics for all of your numeric columns.</li>
<li>Analyzes both numeric and object series, as well as DataFrame column sets of mixed data types.</li>
<li>The output will vary depending on what is provided.</li>
<li>If we want to see the descriptive statistics of an object datatype then we have to specify <b>df.describe(include = "O")</b></li>

In [90]:
schedule_df.describe()

Unnamed: 0,name,appointment_made_date,app_start_date,app_end_date,visitee_namelast,visitee_namefirst,meeting_room,description
count,585,585,585,585,56,585,585,213
unique,542,11,23,9,5,6,13,9
top,Jesus MurilloKaram,2015-01-09T00:00:00,1/12/15 13:00,1/12/15 23:59,/,POTUS,State Floo,JointService Military Honor Guard
freq,3,247,217,286,36,376,279,95


In [91]:
schedule_df.describe(include = "O")

Unnamed: 0,name,appointment_made_date,app_start_date,app_end_date,visitee_namelast,visitee_namefirst,meeting_room,description
count,585,585,585,585,56,585,585,213
unique,542,11,23,9,5,6,13,9
top,Jesus MurilloKaram,2015-01-09T00:00:00,1/12/15 13:00,1/12/15 23:59,/,POTUS,State Floo,JointService Military Honor Guard
freq,3,247,217,286,36,376,279,95


#### Assigning Values With Pandas

<li>Just like in NumPy, the same techniques that we use to select data could be used for assignment.</li>

<li>When we selected a whole column by label and used assignment, we assigned the value to every item in that column.</li>

<li>By providing labels for both axes, we can assign them to a single value within our dataframe.</li>

<code>
    df.loc[row_label, col_label] = assignment_value
</code>

In [92]:
n_df

Unnamed: 0,id,temperature,windspeed,event
0,1,32.0,6.0,Rain
1,2,,9.0,Sunny
2,3,-1.0,,Snow
3,4,,7.0,
4,5,32.0,,Rain


In [99]:
n_df["id"]=[10,11,12,13,14]
n_df

Unnamed: 0,id,temperature,windspeed,event
0,10,32.0,6.0,Rain
1,11,,9.0,Sunny
2,12,-1.0,,Snow
3,13,,7.0,
4,14,32.0,,Rain


#### Using Boolean Indexing With Pandas Objects (Selection With Condition In Pandas)
<li>We can assign a value by using row label and column label in pandas.</li>
<li>But what if we need to assign a same value to a group of similar rows with the same criteria.</li>
<li> Instead, we can use boolean indexing to change all rows that meet the same criteria, just like we did with NumPy.</li>


<ol>
    <li>Equals: df['series'] == value</li>
    <li>Not Equals: df['series'] != value</li>
    <li>Less than: df['series'] < value</li>
    <li>Less than or equal to: df['series'] <= value</li>
    <li>Greater than: df['series'] > value</li>
    <li>Greater than or equal to: df['series'] >= value</li>
</ol>
<li>These conditions can be used in several ways, most commonly inside .loc to select values with conditions.</li>

In [100]:
n_df['id'] == 11

0    False
1     True
2    False
3    False
4    False
Name: id, dtype: bool

In [101]:
n_df.loc[n_df['id'] == 11]

Unnamed: 0,id,temperature,windspeed,event
1,11,,9.0,Sunny


In [102]:
n_df.loc[n_df['id'] < 13]

Unnamed: 0,id,temperature,windspeed,event
0,10,32.0,6.0,Rain
1,11,,9.0,Sunny
2,12,-1.0,,Snow


### Using Pandas Method To Create a Boolean Mask

<li>In the last couple lessons, we used Python boolean operators to create boolean masks to select subsets of data.</li>
    
<li>There are also a number of pandas methods that return boolean masks useful for exploring data.</li>

<li>Two examples are the Series.isnull() method and Series.notnull() method.</li>
<li>Series.isnull() method can be used to select either rows that contain null (or NaN) values for a certain column.</li>
<li>Similarly, Series.notnull() method is used to select rows that do not contain null values for a certain column.</li>

In [103]:
n_df['temperature'].isnull()

0    False
1     True
2    False
3     True
4    False
Name: temperature, dtype: bool

In [104]:
n_df['temperature'].isnull().sum()

2

#### Question 1

<li>Read 'Fortune_1000.csv' file using pandas read_csv() method and store it in a variable named f1000.</li>
<li>Select the rank, revenues, and rank_change columns in f1000. Then, use the df.head() method to select first five rows.</li>
<li>Select just the fifth row of the f1000 dataframe. Assign the result to fifth_row using iloc.</li>
<li>Select the value in first row of the company column. Assign the result to company_value.</li>
<li>Select the last three rows of the f1000 dataframe. Assign the result to last_three_rows.</li>
<li>Select the first to seventh rows and the first five columns of the f1000 dataframe.</li>



#### Question 2
<li>Use the Series.isnull() method to select all rows from f1000 that have a null value for the prev_rank column.</li>
<li>Select only the company, rank, and previous_rank columns where previous_rank column is null.</li>
<li>Use the Series.notnull() method to select all rows from f1000 that have a non-null value for the previous_rank column.</li></b>
<li>From the previously_ranked dataframe, subtract the rank column from the previous_rank column.</li>
<li>Assign the values in the rank_change to a new column in the f1000 dataframe, "rank_change".</li>

#### Question 3
<li>Select all companies with revenues over 100 thousands and negative profits from the f1000 dataframe.</li>

##### Instructions

<li>Create a boolean array that selects the companies with revenues greater than 100 thousands.</li>
<li>Create a boolean array that selects the companies with profits less than 0.</li>


#### Question 4
<li>Select all rows for companies whose city value is either Brazil or Venezuela.</li>
<li>Select the first five companies in the Technology sector for which the city is not the "Boston" from the f1000 dataframe.</li>

#### Sorting Values
<li>We can use the DataFrame.sort_values() method to sort the rows on a particular column.</li>
<li>To do so, we pass the column name to the method:</li>
<code>
sorted_rows = df.sort_values("column_name")
</code>
<li>By default, the sort_values() method will sort the rows in ascending order — from smallest to largest.</li>
<li>To sort the rows in descending order instead, we can set the ascending parameter to False:</li>
<code>
    sorted_rows = df.sort_values("column_name", ascending=False)
</code>


#### Question
<li>Read 'Fortune_1000.csv' using pandas read_csv() method.</li>
<li>Find the company headquartered in Los Angeles with the largest number of employees.</li>
<li>Select only the rows that have a city name equal to Los Angeles.</li>
<li>Use DataFrame.sort_values() to sort those rows by the employees column in descending order.</li>
<li>Use DataFrame.iloc[] to select the first row from the sorted dataframe.</li>


### String Manipulation In Pandas DataFrame

<li>String manipulation is the process of changing, parsing, splitting, 'cleaning' or analyzing strings.</li>
<li>As we know that sometimes, data in the string is not suitable for manipulating the analysis or get a description of the data.</li>
<li>But Python is known for its ability to manipulate strings.</li>
<li>Pandas provides us the ways to manipulate to modify and process string data-frame using some builtin functions.</li>
<li>Some of the most useful pandas string processing functions are as follows:</li>
<ol>
    <li><b>lower()</b></li>
    <li><b>upper()</b></li>
    <li><b>islower()</b></li>
    <li><b>isupper()</b></li>
    <li><b>isnumeric()</b></li>
    <li><b>strip()</b></li>
    <li><b>split()</b></li>
    <li><b>len()</b></li>
    <li><b>get_dummies()</b></li>
    <li><b>startswith()</b></li>
    <li><b>endswith()</b></li>
    <li><b>replace()</b></li>
    <li><b>contains()</b></li>
</ol>


#### 1. lower(): 
<li>It converts all uppercase characters in strings in the dataframe to lower case and returns the lowercase strings in the result.</li>


#### 2. upper():
<li>It converts all lowercase characters in strings in the dataframe to upper case and returns the uppercase strings in result.</li>


#### 3. islower(): 
<li>It checks whether all characters in each string in the Data-Frame is in lower case or not, and returns a Boolean value.</li>


#### 4. isupper(): 
<li>It checks whether all characters in each string in the Data-Frame is in upper case or not, and returns a Boolean value.</li>


#### 5. isnumeric():
<li>It checks whether all characters in each string in the Data-Frame are numeric or not, and returns a Boolean value.</li>


#### 6. strip():
<li>If there are spaces at the beginning or end of a string, we should trim the strings to eliminate spaces using strip() method.</li>
<li>It remove the extra spaces contained by a string in a DataFrame.</li>


#### 7. split(‘ ‘):
<li>It splits each string with the given pattern.</li>
<li>Strings are split and the new elements after the performed split operation, are stored in a list.</li>


#### 8. len():
<li>With the help of len() we can compute the length of each string in DataFrame.</li>
<li>If there is empty data in a DataFrame, it returns NaN.</li>


#### 9. get_dummies(): 
<li>It returns the DataFrame with One-Hot Encoded values like we can see that it returns boolean value 1 if it exists in relative index or 0 if not exists.</li>


#### 10. startswith(pattern):
<li>It returns true if the element or string in the DataFrame Index starts with the pattern.</li>
<li>If you wanted to filter out rows that startswith 'ind' then you can specify df[df[col].str.startswith('ind')</li>


#### 11. endswith(pattern):
<li>It returns true if the element or string in the DataFrame Index ends with the pattern.</li>
<li>If you wanted to filter out rows that ends with 'es' then you can specify df[df[col].str.endswith('es')</li>


#### 12. replace(a,b):
<li>It replaces the value a with the value b.</li>
<li>If you wanted to remove white space characters then you can use replace() method as:</li>
<code>
df[col_name].str.replace(" ", "")
</code>


#### 13. contains():
<li>contains() method checks whether the string contains a particular substring or not.</li>
<li>The function is quite similar to replace() but instead of replacing the string itself it just returns the boolean value True or False.</li>
<li>If a substring is present in a string, then it returns boolean value True else False.</li>



#### Handling Missing Values
<li>We can use fillna() method in pandas to fill missing values using different ways.</li>
<li>We can use interpolation method to make a guess on missing values.</li>
<li>We can use dropna() method to drop rows with missing values.</li>
<li>We can also fill missing values with the mean value, median value or the mode value depending on the values of columns.</li>
<li>Filling missing values with mean and median is appropriate when the column has continuous values.</li>
<li>If the data is categorical then filling missing values with mode is a good idea.</li>

#### fillna(method = 'ffill')

#### fillna(method = 'bfill')

#### Interpolate(Linear Interpolation)
<li>method = time</li>

#### dropna()
<li>dropna() with how and threshold parameter</li>