# Introduction to Pandas for data manupulation
### Pandas is an open-source python library that is used for data manipulation and analysis.
![image.png](attachment:image.png)

### What is Pandas?
* Pandas is a Python library used for working with data sets.
* It has functions for analyzing, cleaning, exploring, and manipulating data.
* The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and was created by Wes McKinney in 2008.

## Why Use Pandas?
* Pandas allows us to analyze big data and make conclusions based on statistical theories.
* Pandas can clean messy data sets, and make them readable and relevant.
* Relevant data is very important in data science.
* https://pandas.pydata.org/pandas-docs/stable/reference/

## A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table with rows and columns.

In [2]:
import pandas as pd # pip install pandas
 #Query to get additional information

In [3]:
import numpy as np
np?

### Create a DataFrame from list

In [4]:
data=[['Sia',21],['Nick',20],['James',19]]
type(data)

list

In [4]:
print(data)

[['Sia', 21], ['Nick', 20], ['James', 19]]


In [5]:
df_list=pd.DataFrame(data,columns=['Name','Age'])
print(df_list)

    Name  Age
0    Sia   21
1   Nick   20
2  James   19


In [6]:
type(df_list)

pandas.core.frame.DataFrame

### Create a DataFrame from NumPy array

In [7]:
data1=np.array([['Sia',21],['Nick',20],['James',19]])
type(data1)

numpy.ndarray

In [8]:
df_array=pd.DataFrame(data1,columns=['Name','Age'])
df_array

Unnamed: 0,Name,Age
0,Sia,21
1,Nick,20
2,James,19


### Create a DataFrame from dictionary

In [1]:
data2={'Name': ['Sia','Nick','James'], 'Age': [21,20,19]}
data2

{'Name': ['Sia', 'Nick', 'James'], 'Age': [21, 20, 19]}

In [10]:
type(data2)

dict

In [11]:
df_dict=pd.DataFrame(data2)
df_dict

Unnamed: 0,Name,Age
0,Sia,21
1,Nick,20
2,James,19


## Read
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html

In [12]:
df=pd.read_csv("police.csv")

FileNotFoundError: [Errno 2] No such file or directory: 'police.csv'

### head() to display the top 5 rows from our data set.
### tail() to display last 5 rows

In [None]:
df.head()

In [None]:
#top 10 values
df.head(10)

In [None]:
df.tail(5)

### `sample()` allows us to choose random values from our data frame. We can pass it the no. of rows that we want to fetch as a parameter

In [None]:
df.sample(5)

## `shape` to see dimension of our dataset,

In [None]:
df.shape
#rows, columns

### More information about dataset
- info()
- columns: get the name of all the features/columns in our data frame

In [None]:
df.info()

In [None]:
df.columns

### datetime()
Pandas to_datetime() is able to parse any valid date string to datetime without any additional arguments.
* Convert strings to datetime
* Assemble a datetime from multiple columns
* Get year, month and day
* Get the week of year, the day of week, and leap year
* Get the age from the date of birth
* Improve performance by setting date column as the index
* Select data with a specific year and perform aggregation
* Select data with a specific month and a specific day of the month
* Select data between two dates
* Handle missing values
### Try all those:
1. print(pd.datetime.now())
2. print(pd.datetime.now().date())
3. print(pd.datetime.now().year)
4. print(pd.datetime.now().month)
5. print(pd.datetime.now().day)
6. print(pd.datetime.now().hour)
7. print(pd.datetime.now().minute)
8. print(pd.datetime.now().second)
9. print(pd.datetime.now().microsecond)

In [None]:
from datetime import datetime
datetime.now()

In [None]:
datetime.now().minute

In [None]:
datetime.now().hour

In [None]:
datetime.now().year

In [None]:
df['stop_date']=pd.to_datetime(df['stop_date'])

In [None]:
df['stop_time']=pd.to_datetime(df['stop_time'])

In [None]:
df.info()

In [None]:
df.head()

In [None]:
df['year'] = df['stop_date'].dt.year

In [None]:
df['month']=df['stop_date'].dt.month
df['day']=df['stop_date'].dt.day

In [None]:
df.head()

In [None]:
df['hour']=df['stop_time'].dt.hour
df.sample(3)

In [None]:
df['driver_age'] = 2023 - df['driver_age_raw']
df.sample(3)

###  `drop()` to drop that particular column
- The `axis` argument specifies whether to drop rows (0) or columns (1).
- The `inplace` argument specifies to drop the columns in place without reassigning the DataFrame.

In [None]:
df.columns

In [None]:
df.drop(['county_name','driver_age_raw','violation_raw','month','day','hour'],axis=1,inplace=True) #axis 1 columns, 0 rows. 
df.sample(3)

### `nunique()`  can use to find the no. of unique values in our series or data

In [None]:
df['violation'].nunique()

In [None]:
df['driver_race'].nunique()

### value_counts() is used to identify the different categories in a feature as well as the count of values per category

In [None]:
df['violation'].value_counts()

In [None]:
df['search_conducted'].value_counts()

In [None]:
#driver_race
df['driver_race'].value_counts()

In [None]:
#stop_outcome 
df['stop_outcome'].value_counts()

### `describe()`to get various information about the numerical columns

In [None]:
df.describe()

### `isnull()`.sum() to find that the no. of null values in a DataFrame

In [None]:
df.isnull().sum()

###  `fillna()` by using the “isnull()” and “sum()” functions, we can check if our data has any missing values or not. we will fill up the missing values using the mode(a value that appears most frequently in a data set) of this particular feature using the “fillna()” function.
- mean (Numerical)
- mode/most frequent (Numerical/ Categorical)
- median (Numerical)
- constant(Numerical/ categorical)


#### inplace=true
* Pandas create a copy of the original data.
* Performs the required operation on it.
* Assigns the results to the original data. (Important point to consider here).
* Then deletes the copy.

In [None]:
df.columns

In [None]:
gender=df['driver_gender'].mode()[0]
print(gender)
df['driver_gender'].fillna(gender,inplace=True)
df['driver_gender'].isnull().sum()

In [None]:
#driver_race
race=df['driver_race'].mode()[0]
print(race)
df['driver_race'].fillna(race,inplace=True)
df['driver_race'].isnull().sum()

In [None]:
#stop_outcome
stop_outcome=df['stop_outcome'].mode()[0]
print(stop_outcome)
df['stop_outcome'].fillna('Unknown',inplace=True)
df['stop_outcome'].isnull().sum()

In [None]:
#is_arrested
is_arrested=df['is_arrested'].mode()[0]
print(is_arrested)
df['is_arrested'].fillna(is_arrested,inplace=True)
df['is_arrested'].isnull().sum()

In [None]:
#violation
violation=df['violation'].mode()[0]
print(violation)
df['violation'].fillna(violation,inplace=True)
df['violation'].isnull().sum()

### Plot Make plots of Series or DataFrame.

Uses the backend specified by the option plotting.backend. By default, matplotlib is used. The `kind` of plot to produce:

- ‘line’ : line plot (default)

- ‘bar’ : vertical bar plot

- ‘barh’ : horizontal bar plot

- ‘hist’ : histogram

- ‘box’ : boxplot

- ‘kde’ : Kernel Density Estimation plot

- ‘density’ : same as ‘kde’

- ‘area’ : area plot

- ‘pie’ : pie plot

- ‘scatter’ : scatter plot (DataFrame only)

- ‘hexbin’ : hexbin plot (DataFrame only)

In [None]:
df.columns

In [None]:
#bar plot for violation
df['violation'].value_counts().plot(kind='bar')

In [None]:
#pie plot for violation
df['violation'].value_counts().plot(kind='pie')

In [None]:
#histogram for age
df['driver_age'].plot(kind='hist')

In [None]:
#boxplot for age
df['driver_age'].plot(kind='box')

In [None]:
df.describe()

In [None]:
#Plot horiz bar for stop_outcome
df['stop_outcome'].value_counts().plot(kind='barh')

In [None]:
df.plot.scatter(x='year',y='driver_age')

###  `nsmallest() & nlargest()` used to obtain “n” no. of rows from our dataset which are lowest or highest respectively

In [None]:
df.nlargest(3, 'driver_age')

In [None]:
df.nsmallest(3, 'driver_age')

### `groupby()` is very useful in data analysis as it allows us to unveil the underlying relationships among different variables. 
And then we can apply Aggregations as well on the groups with the “agg()” function and pass it with various aggregation operations such as mean, size, sum, std etc.

In [None]:
import numpy as np
df.groupby('driver_race').median()

In [None]:
df.groupby('violation').mean()

### loc() and iloc()
loc() and iloc() methods are used in slicing data from the pandas DataFrame which helps in filtering the data according to some given condition.
* loc – select by labels
* iloc – select by positions

In [None]:
# Location of driver_age and violation
df.loc[:,['driver_age','violation']]

In [None]:
# Location of driver_age and violation for rows 3 to 10
df.loc[3:10,['driver_age','violation']]

In [None]:
# Location of driver_age and violation for rows 3 to 10 step of 2
df.loc[3:10:2,['driver_age','violation']]

In [None]:
#print the entries where the driver age is less than 17 and the violation is 
df.loc[(df.driver_age<17) & (df.violation=='Speeding')].sample(3)

In [None]:
#print top 4 rows where driver_age is 21 and the driver gender is M
df.loc[(df.driver_age==21)&(df.driver_gender=='M')].head(4)

In [None]:
#iloc
# Price the data at row 5 and column 6
df.iloc[5,6]

In [None]:
df.iloc[:5,:6]

###  Sorting “sort_index()” and “sort_values()”

In [None]:
# sort values
# extract driver age, driver_race, and violation
df.loc[:,['driver_age','driver_race','violation']].sort_values(by='driver_age')

### `Query` filter our data frame as per our conditions

In [None]:
# print top 10 rows where the drivers are between 31 to 50 years old 
df.query('driver_age>30 & driver_age<=50').head(10)

In [None]:
# print three different entries where driver is 18 years old and female
df.query('driver_age==18 & driver_gender=="M"').sample(3)

In [None]:
#Find out top 3 entries where the driver age is 21, violation is "Speeding", driver is not arrested
df1=df.query('driver_age>=45 & violation=="Speeding" & is_arrested==False')

In [None]:
df1.head()
df1.shape

## Save dataframe to CSV format

In [None]:
df1.to_csv('police_new.csv',index=False) #need extension

In [None]:
df_new=pd.read_csv('police_new.csv',index_col=[0])
df_new.sample(5)

In [None]:
df_new.info()

In [13]:
df.groupby('violation').agg('min')

NameError: name 'df' is not defined