# *Prolog of Data Processing with Pandas  :*

Most of the time of data analysis and modeling is spent on data preparation and processing i.e., loading, cleaning and rearranging the data, etc. Further, because of Python libraries, Pandas give us high performance, flexible, and high-level environment for processing the data. Various functionalities are available for pandas to process the data effectively.


## Loading a csv file in pandas :

In [20]:
import pandas as pd

df =pd.read_csv("aaraiz.csv")

df

Unnamed: 0,Temprature,Humidity,Wind,Rain
0,60.0,75,4,
1,48.0,68,6,NO
2,45.0,52,3,NO
3,46.0,...56,4,NO
4,60.0,75,4,
...,...,...,...,...
90,67.0,/69,7,N
91,44.0,,8,
92,54.0,62,9,N
93,58.0,58,3,N


## Data Cleaning with pandas  :
Data cleaning means fixing bad data in your data set.
Bad data could be:
- Empty cells
- Data in wrong format
- Wrong data
- Duplicates

## Removing duplicates by using pandas  :


In [25]:
df.drop_duplicates()

Unnamed: 0,Temprature,Humidity,Wind,Rain
0,60.0,75,4,
1,48.0,68,6,NO
2,45.0,52,3,NO
3,46.0,...56,4,NO
5,48.0,...63,3,NO
...,...,...,...,...
89,59.0,52,2,NO
90,67.0,/69,7,N
92,54.0,62,9,N
93,58.0,58,3,N


## Saving dataset after removing duplicates  :

In [26]:
df = df.drop_duplicates()

## Viewing dataset without duplicates  :
As we can see from 94xRows the duplicates has been removed now our number of rows are 73xRows. We have successfully removed duplicates now lets move forward to slash and dot removals  :

In [27]:
df

Unnamed: 0,Temprature,Humidity,Wind,Rain
0,60.0,75,4,
1,48.0,68,6,NO
2,45.0,52,3,NO
3,46.0,...56,4,NO
5,48.0,...63,3,NO
...,...,...,...,...
89,59.0,52,2,NO
90,67.0,/69,7,N
92,54.0,62,9,N
93,58.0,58,3,N


## Removing useless symbols&special char :

The strip() method removes any leading, and trailing whitespaces. Leading means at the beginning of the string, trailing means at the end. You can specify which character(s) to remove, if not, any whitespaces will be removed.

In [40]:
df["Humidity"].str.lstrip("/")
df["Humidity"].str.lstrip("...")



0     75
1     68
2     52
3     56
5     63
      ..
89    52
90    69
92    62
93    58
94    61
Name: Humidity, Length: 73, dtype: object

## Saving and updating single column dataset :

In [41]:
df["Humidity"] =df["Humidity"].str.lstrip("/")
df["Humidity"] =df["Humidity"].str.lstrip("...")


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["Humidity"] =df["Humidity"].str.lstrip("/")
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["Humidity"] =df["Humidity"].str.lstrip("...")


## Viewing dataset after removing symbols&Char  :

In [39]:
df

Unnamed: 0,Temprature,Humidity,Wind,Rain
0,60.0,75,4,
1,48.0,68,6,NO
2,45.0,52,3,NO
3,46.0,56,4,NO
5,48.0,63,3,NO
...,...,...,...,...
89,59.0,52,2,NO
90,67.0,69,7,N
92,54.0,62,9,N
93,58.0,58,3,N


## Filling the N or NaN values with blanks  :
.fillna is a command which we can use to replace null values by anything written in between('')  :

In [58]:
df.fillna('')

Unnamed: 0,Temprature,Humidity,Wind,Rain
0,60.0,75,4,
1,48.0,68,6,NO
2,45.0,52,3,NO
3,46.0,56,4,NO
5,48.0,63,3,NO
...,...,...,...,...
89,59.0,52,2,NO
90,67.0,69,7,N
92,54.0,62,9,N
93,58.0,58,3,N


## Saving dataset with latest updates  :

In [54]:
df =df.fillna('')

## Viewing updated dataset :
As we can see all null values are replaced with blanks now :

In [55]:
df

Unnamed: 0,Temprature,Humidity,Wind,Rain
0,60.0,75,4,
1,48.0,68,6,NO
2,45.0,52,3,NO
3,46.0,56,4,NO
5,48.0,63,3,NO
...,...,...,...,...
89,59.0,52,2,NO
90,67.0,69,7,N
92,54.0,62,9,N
93,58.0,58,3,N


## Replacing Yes with Y  :
By viewing dataset different words in single column is not good so lets make it similar and clean  :

In [61]:
df["Rain"].str.replace('YES' , 'Y')

0       
1     NO
2     NO
3     NO
5     NO
      ..
89    NO
90     N
92     N
93     N
94    NO
Name: Rain, Length: 73, dtype: object

## Replacing No with N  :
By viewing dataset different words in single column is not good so lets make it similar and clean  :

In [62]:
df["Rain"].str.replace('NO' , 'N')

0      
1     N
2     N
3     N
5     N
     ..
89    N
90    N
92    N
93    N
94    N
Name: Rain, Length: 73, dtype: object

## Saving the cleaned data with Y & N :

In [63]:
df["Rain"] =df["Rain"].str.replace('YES' , 'Y')
df["Rain"] =df["Rain"].str.replace('NO' , 'N')

## Viewing the cleaned dataset  :

In [64]:
df

Unnamed: 0,Temprature,Humidity,Wind,Rain
0,60.0,75,4,
1,48.0,68,6,N
2,45.0,52,3,N
3,46.0,56,4,N
5,48.0,63,3,N
...,...,...,...,...
89,59.0,52,2,N
90,67.0,69,7,N
92,54.0,62,9,N
93,58.0,58,3,N


# Exploratory Data Analysis in Pandas of my dataset  :
Pandas profiling is a Python library that performs an automated Exploratory Data Analysis. It automatically generates a dataset profile report that gives valuable insights  :

- **Type inference:** detect the types of columns in a DataFrame
- **Essentials:** type, unique values, indication of missing values
- **Quantile statistics:** minimum value, Q1, median, Q3, maximum, range, interquartile range
- **Descriptive statistics:** mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness Most frequent and extreme values
- **Histograms:** categorical and numerical
- **Correlations:** high correlation warnings, based on different correlation metrics (Spearman, Pearson, Kendall, Cramér’s V, Phik)
- **Missing values:** through counts, matrix and heatmap
- **Duplicate rows:** list of the most common duplicated rows
- **Text analysis:** most common categories (uppercase, lowercase, separator), scripts (Latin, Cyrillic) and blocks (ASCII, Cyrilic)
- **File and Image analysis:** file sizes, creation dates, dimensions, indication of truncated images and existence of EXIF metadata.

In [65]:
df.describe()

Unnamed: 0,Wind
count,73.0
mean,5.452055
std,2.466673
min,1.0
25%,4.0
50%,5.0
75%,7.0
max,11.0


In [66]:
df.shape

(73, 4)

## Viewing titles of columns  : 

In [67]:
df.columns

Index(['Temprature', 'Humidity', 'Wind', 'Rain'], dtype='object')

## Viewing all null values with counts  :

In [73]:
df.isnull().sum()

Temprature    0
Humidity      0
Wind          0
Rain          0
dtype: int64

## Viewing All unique values by columns  :

In [74]:
df.nunique()

Temprature    38
Humidity      41
Wind          11
Rain           4
dtype: int64

## Sorting humidity column  :

In [75]:
df.sort_values(by="Humidity").head()

Unnamed: 0,Temprature,Humidity,Wind,Rain
58,44.0,,8,
27,54.0,31.0,5,N
25,52.0,35.0,4,N
86,55.0,36.0,6,N
12,,38.0,7,N


# Viewing datatype of columns of dataset  :

In [78]:
df.dtypes

Temprature    object
Humidity      object
Wind           int64
Rain          object
dtype: object

#                       **THANK YOU**