## EDA
Exploratory Data Analysis (EDA) is a critical step in the data analysis process, allowing analysts to explore and understand the data they are working with. Python with Pandas provides an ideal platform for EDA, offering a range of functions and tools to make the process efficient and effective. 

To start an EDA process in Python with Pandas, one can load the data into a Pandas DataFrame and explore it using various functions, such as `head()`, `tail()`, `info()`, `describe()`, and `shape`. Once the data is inspected, one can begin cleaning and transforming it, using Pandas functions such as `fillna()`, `dropna()`, `apply()`, and `groupby()`. 

EDA also involves data visualization, which Pandas provides a range of functions for, including `plot()`, `hist()`, `scatter()`, and `boxplot()`. These functions enable analysts to create various types of charts and graphs that help them identify patterns and relationships in the data.

EDA is an essential step in the data analysis process, and Python with Pandas provides a powerful and efficient toolset for performing EDA. By leveraging Pandas functions and tools, analysts can gain insights into their data and make informed decisions that can be used to build predictive models or make strategic business decisions.

Three types of EDA:
1. Univariate analysis
     - Univariate EDA involves analyzing one variable at a time, examining its distribution, central tendency, and variability using statistical measures and visualizations. This approach helps analysts understand the behavior of individual variables in the dataset.




2. Bivariate analysis
    - Bivariate EDA involves analyzing the relationship between two variables using statistical measures and visualizations such as scatter plots, heatmaps, and correlation matrices. This approach helps analysts identify patterns and relationships between two variables.
3. Multivariate analysis
    - Multivariate EDA involves analyzing multiple variables simultaneously, examining their relationship and interactions using techniques such as principal component analysis (PCA), factor analysis, and clustering. This approach helps analysts identify complex patterns and relationships between multiple variables.

In [51]:
import pandas as prathamesh_pd

In [52]:
prathamesh_dataframe = prathamesh_pd.DataFrame({
    'col1': [1, 2, 3, 4],
    'col2': [444, 555, 666, 444],
    'col3': ['pratham', 'shubham', 'saurabh', 'saurabh']
})
prathamesh_dataframe.head()

Unnamed: 0,col1,col2,col3
0,1,444,pratham
1,2,555,shubham
2,3,666,saurabh
3,4,444,saurabh


In [53]:
prathamesh_dataframe.head(2)

Unnamed: 0,col1,col2,col3
0,1,444,pratham
1,2,555,shubham


### info()
The `df.info()` function in Pandas returns a summary of the DataFrame's metadata, including the total number of rows and columns, the data type of each column, and the number of non-null values in each column. This function is useful in the exploratory data analysis process as it provides a quick overview of the dataset, helping analysts understand the structure of the data they are working with. 

The output of `df.info()` includes the following information for each column:
- The column name
- The number of non-null values in the column
- The data type of the column
- The memory usage of the column
- Additional information such as the number of unique values and the presence of null values. 

By inspecting this output, analysts can quickly identify columns with null values or columns with unexpected data types, helping them to plan their data cleaning and transformation strategies.

In [54]:
prathamesh_dataframe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   col1    4 non-null      int64 
 1   col2    4 non-null      int64 
 2   col3    4 non-null      object
dtypes: int64(2), object(1)
memory usage: 228.0+ bytes


- `unique()` function returns an array of unique values in a Pandas Series or a DataFrame column.
- `nunique()` function returns the count of unique values in a Pandas Series or a DataFrame column.
- `value_counts()` function returns a Series containing the counts of unique values in a Pandas Series or a DataFrame column, sorted in descending order.
- `head()` function returns the first n rows of a DataFrame, where n is a parameter (default is 5).
- `tail()` function returns the last n rows of a DataFrame, where n is a parameter (default is 5).
- `apply()` function applies a function along a specific axis of a DataFrame, row-wise or column-wise, depending on the axis parameter.
- `mean()` function calculates the mean of the values in a Pandas Series or a DataFrame column.
- `median()` function calculates the median of the values in a Pandas Series or a DataFrame column.
- `mode()` function calculates the mode of the values in a Pandas Series or a DataFrame column.
- `sum()` function calculates the sum of the values in a Pandas Series or a DataFrame column.

In [55]:
prathamesh_dataframe['col2'].unique()

array([444, 555, 666], dtype=int64)

In [56]:
prathamesh_dataframe['col3'].nunique()

3

In [57]:
prathamesh_dataframe['col2'].value_counts()

col2
444    2
555    1
666    1
Name: count, dtype: int64

In [58]:
prathamesh_dataframe.tail(2)

Unnamed: 0,col1,col2,col3
2,3,666,saurabh
3,4,444,saurabh


In [59]:
prathamesh_dataframe['col3'].apply(len)

0    7
1    7
2    7
3    7
Name: col3, dtype: int64

In [60]:
prathamesh_dataframe['col2'].mean()

527.25

In [61]:
prathamesh_dataframe['col2'].sum()

2109

In [62]:
prathamesh_dataframe['col2'].median()

499.5

In [63]:
prathamesh_dataframe['col2'].mode()

0    444
Name: col2, dtype: int64

In [64]:
del prathamesh_dataframe['col1']

In [65]:
prathamesh_dataframe

Unnamed: 0,col2,col3
0,444,pratham
1,555,shubham
2,666,saurabh
3,444,saurabh


Dataset.column

In [66]:
prathamesh_dataframe.columns

Index(['col2', 'col3'], dtype='object')

In [67]:
prathamesh_dataframe.index

RangeIndex(start=0, stop=4, step=1)

Sorting data set using different columns

In [68]:
prathamesh_dataframe.sort_values(by='col2')

Unnamed: 0,col2,col3
0,444,pratham
3,444,saurabh
1,555,shubham
2,666,saurabh


In [69]:
prathamesh_dataframe['col2'][2] = 420

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  prathamesh_dataframe['col2'][2] = 420


The `SettingWithCopyWarning` is a warning message in Pandas that appears when we try to modify a slice of a DataFrame that is a view of the original data instead of a copy. This warning occurs when we create a new DataFrame by indexing or slicing an existing DataFrame, and then modify the data in the new DataFrame. 

The warning is a cautionary message to indicate that the changes we make to the new DataFrame may not be reflected in the original DataFrame, and vice versa. This can lead to unexpected behavior and errors in our data analysis. 

To avoid this warning, we can explicitly make a copy of the slice using the `copy()` method or by using the `.loc` or `.iloc` indexers to modify the data. This ensures that the changes we make are applied to the original DataFrame and not just a copy or a view of the data.

In [70]:
prathamesh_dataframe

Unnamed: 0,col2,col3
0,444,pratham
1,555,shubham
2,420,saurabh
3,444,saurabh


### sort_values()

The `sort_values()` function is a powerful method in Pandas for sorting the data in a DataFrame by one or more columns. This function can be used to sort the data in ascending or descending order based on the values in one or more columns.

To use the `sort_values()` function, we pass the column name or a list of column names to the `by` parameter. We can also specify the sorting order using the `ascending` parameter, which is set to True by default. 

To sort a DataFrame by multiple columns, we can pass a list of column names to the `by` parameter. The DataFrame will be sorted by the first column in the list, then by the second column, and so on.

The `sort_values()` function is useful in exploratory data analysis as it allows analysts to quickly sort and view the data in different orders to gain insights and identify patterns. It is also useful in data preprocessing, where we may need to sort the data based on specific columns before performing data transformations or analysis.

In [71]:
prathamesh_dataframe.sort_values(by=['col2','col3'])

Unnamed: 0,col2,col3
2,420,saurabh
0,444,pratham
3,444,saurabh
1,555,shubham


In [72]:
prathamesh_dataframe.describe()

Unnamed: 0,col2
count,4.0
mean,465.75
std,60.56608
min,420.0
25%,438.0
50%,444.0
75%,471.75
max,555.0


The `transpose()` function takes no parameters and returns a new DataFrame with the rows and columns switched. This function is useful in exploratory data analysis as it allows analysts to view the data from a different perspective, which can help them identify patterns and relationships in the data.

In [73]:
prathamesh_dataframe.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
col2,4.0,465.75,60.56608,420.0,438.0,444.0,471.75,555.0


In [74]:
prathamesh_dataframe.min()

col2        420
col3    pratham
dtype: object

In [75]:
prathamesh_dataframe.max()

col2        555
col3    shubham
dtype: object

In [76]:
prathamesh_dataframe.sum()

col2                            1863
col3    prathamshubhamsaurabhsaurabh
dtype: object