<a href="https://colab.research.google.com/github/M-S-Dhanushkumar/Python_101/blob/main/Pandas_Diabetes_Analysis_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Python Pandas Library**
Pandas is like a superhero for handling data in Python. It helps us work with data in a way that makes it easier to understand, analyze, and manipulate.

Imagine you have a large table of data, just like an Excel spreadsheet. Pandas allows you to bring that data into Python and perform various operations on it. It helps you organize and make sense of the data.

**Import Pandas:** Importing the Python pandas library is like bringing in a specialized toolbox that contains helpful tools for working with data. It allows you to access functions and capabilities specifically designed to handle data efficiently. By importing pandas, you gain the ability to read, manipulate, analyze, and visualize data more easily in your Python code. It's like equipping yourself with the right tools to handle data-related tasks effectively.

In [5]:
#To make use of pandas library we have to import it
import pandas as pd

**pd.read_csv**: The function pd.read_csv is like a special tool provided by pandas that helps you read data stored in a CSV file. It takes the CSV file's location as input and creates a table-like structure called a DataFrame in Python.

In [51]:
# Read the clinical data from a Cclinical_data.csv file
df = pd.read_csv('/content/diabetes.csv')

**head()**: When you have a large DataFrame with lots of rows and columns, it can be overwhelming to display the entire dataset. Instead, you can use head() to get a concise view of the beginning of the DataFrame.

In [50]:
# Display the first few rows of the DataFrame
print(df.head())

   Pregnancies  Glucose  BP  SkinThickness  Insulin   BMI  \
0            6      148  72             35        0  33.6   
1            1       85  66             29        0  26.6   
2            8      183  64              0        0  23.3   
3            1       89  66             23       94  28.1   
4            0      137  40             35      168  43.1   

   DiabetesPedigreeFunction  Age  Outcome  
0                     0.627   50        1  
1                     0.351   31        0  
2                     0.672   32        1  
3                     0.167   21        0  
4                     2.288   33        1  


**Select Columns:** When you have a DataFrame containing multiple columns, you may often need to focus on specific columns that are relevant to your analysis or task. Pandas provides various techniques to select specific columns based on your requirements.
One common method is to use the square bracket notation [ ] with the column names as strings.

In [43]:
#selecting specific columns Age and BMI
col = df[['Age', 'BMI']]
print(col)

     Age   BMI
0     50  33.6
1     31  26.6
2     32  23.3
3     21  28.1
4     33  43.1
..   ...   ...
763   63  32.9
764   27  36.8
765   30  26.2
766   47  30.1
767   23  30.4

[768 rows x 2 columns]


**Filter Data:** Filtering in pandas helps you extract the data you need for analysis, visualization, or further manipulation. It allows you to focus on specific subsets of data that meet particular criteria, enabling you to derive meaningful insights and draw conclusions from your dataset efficiently. (Syntax: df[df['column_name'] > value])

In [44]:
#Filtering age greater than 30
Filter_age = df[df['Age'] > 30]
print(Filter_age)

     Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0              6      148             72             35        0  33.6   
1              1       85             66             29        0  26.6   
2              8      183             64              0        0  23.3   
4              0      137             40             35      168  43.1   
8              2      197             70             45      543  30.5   
..           ...      ...            ...            ...      ...   ...   
759            6      190             92              0        0  35.5   
761            9      170             74             31        0  44.0   
762            9       89             62              0        0  22.5   
763           10      101             76             48      180  32.9   
766            1      126             60              0        0  30.1   

     DiabetesPedigreeFunction  Age  Outcome  
0                       0.627   50        1  
1                  

In [17]:
# Filtering the data for patients with high BMI
high_bmi_patients = df[df['BMI'] > 30]
print(high_bmi_patients)


     Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0              6      148             72             35        0  33.6   
4              0      137             40             35      168  43.1   
6              3       78             50             32       88  31.0   
7             10      115              0              0        0  35.3   
8              2      197             70             45      543  30.5   
..           ...      ...            ...            ...      ...   ...   
761            9      170             74             31        0  44.0   
763           10      101             76             48      180  32.9   
764            2      122             70             27        0  36.8   
766            1      126             60              0        0  30.1   
767            1       93             70             31        0  30.4   

     DiabetesPedigreeFunction  Age  Outcome  
0                       0.627   50        1  
4                  

**rename( ):** When working with data, you may encounter situations where you need to change the names of columns or index labels to make them more meaningful or align with your analysis requirements. The rename() function provides a convenient way to modify these labels. (Syntax: df.rename(columns={'old_name': 'new_name'}))

In [46]:
#Renaming the Bloodpressure column to BP
Rename = df.rename(columns={'BloodPressure': 'BP'}, inplace=True)
print(df)

     Pregnancies  Glucose  BP  SkinThickness  Insulin   BMI  \
0              6      148  72             35        0  33.6   
1              1       85  66             29        0  26.6   
2              8      183  64              0        0  23.3   
3              1       89  66             23       94  28.1   
4              0      137  40             35      168  43.1   
..           ...      ...  ..            ...      ...   ...   
763           10      101  76             48      180  32.9   
764            2      122  70             27        0  36.8   
765            5      121  72             23      112  26.2   
766            1      126  60              0        0  30.1   
767            1       93  70             31        0  30.4   

     DiabetesPedigreeFunction  Age  Outcome  
0                       0.627   50        1  
1                       0.351   31        0  
2                       0.672   32        1  
3                       0.167   21        0  
4            

**dropna():** When you're dealing with data, it's common to encounter missing values represented as NaN or None. These missing values can create issues when performing data analysis or processing. The dropna() function comes to the rescue by allowing you to get rid of rows or columns that have these missing values. In simpler terms, dropna() helps you clean up your data by removing any rows or columns that have missing values so that you can work with a more complete and reliable dataset.

In [52]:
# Drop rows with missing values
drop = df.dropna()
print(drop)


     Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0              6      148             72             35        0  33.6   
1              1       85             66             29        0  26.6   
2              8      183             64              0        0  23.3   
3              1       89             66             23       94  28.1   
4              0      137             40             35      168  43.1   
..           ...      ...            ...            ...      ...   ...   
763           10      101             76             48      180  32.9   
764            2      122             70             27        0  36.8   
765            5      121             72             23      112  26.2   
766            1      126             60              0        0  30.1   
767            1       93             70             31        0  30.4   

     DiabetesPedigreeFunction  Age  Outcome  
0                       0.627   50        1  
1                  

**fillna( ):** The fillna() function in pandas helps you handle missing values by filling them with specified values or using various methods. It allows you to address the gaps in your data, making it more suitable for analysis and processing.

In [54]:
# Fill missing values with a specific value
missing_values = df.fillna('NA')
print(missing_values)

     Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0              6      148             72             35        0  33.6   
1              1       85             66             29        0  26.6   
2              8      183             64              0        0  23.3   
3              1       89             66             23       94  28.1   
4              0      137             40             35      168  43.1   
..           ...      ...            ...            ...      ...   ...   
763           10      101             76             48      180  32.9   
764            2      122             70             27        0  36.8   
765            5      121             72             23      112  26.2   
766            1      126             60              0        0  30.1   
767            1       93             70             31        0  30.4   

     DiabetesPedigreeFunction  Age  Outcome  
0                       0.627   50        1  
1                  

**mean( ):** When you have a collection of numerical data, you often want to understand the typical or average value. The mean() function comes in handy by providing a straightforward way to calculate this average. (Syntax: df['numbers'].mean())

In [8]:
# Calculate the average age of patients
average_age = df['Age'].mean()
print("Average Age:", average_age)

Average Age: 33.240885416666664


In [31]:
# Calculate the average age for patients with high BMI
average_age_high_bmi = high_bmi_patients['Age'].mean()
print("Average Age for High BMI Patients:", average_age_high_bmi)

Average Age for High BMI Patients: 33.79139784946236


**describe( ):** By applying describe() to a DataFrame or Series, it computes various statistical measures such as count, mean, standard deviation, minimum value, 25th percentile, median (50th percentile), 75th percentile, and maximum value.

In [14]:
# Calculate descriptive statistics of the dataset
statistics = df.describe()
print(statistics)

       Pregnancies     Glucose  BloodPressure  SkinThickness     Insulin  \
count   768.000000  768.000000     768.000000     768.000000  768.000000   
mean      3.845052  120.894531      69.105469      20.536458   79.799479   
std       3.369578   31.972618      19.355807      15.952218  115.244002   
min       0.000000    0.000000       0.000000       0.000000    0.000000   
25%       1.000000   99.000000      62.000000       0.000000    0.000000   
50%       3.000000  117.000000      72.000000      23.000000   30.500000   
75%       6.000000  140.250000      80.000000      32.000000  127.250000   
max      17.000000  199.000000     122.000000      99.000000  846.000000   

              BMI  DiabetesPedigreeFunction         Age     Outcome  
count  768.000000                768.000000  768.000000  768.000000  
mean    31.992578                  0.471876   33.240885    0.348958  
std      7.884160                  0.331329   11.760232    0.476951  
min      0.000000                  

**groupby( ):** When you have a lot of data, sometimes you want to organize and understand it better by grouping it into categories or groups. The groupby() function in pandas helps you do that by grouping the data based on the values in one or more columns. It allows you to bring similar data together so that you can analyze and summarize it more easily. (**Syntax: df.groupby('category')**).                                                  

Here we used combined, the apply() and lambda syntax allow you to apply a custom function to each element or row/column of a DataFrame using a concise, inline approach.

In [25]:
#group the data by ages and pregnancies of Insulin  in descending order.
grouped_data = df.groupby(['Age', 'Pregnancies']).apply(lambda x: x.sort_values('Insulin', ascending=False))
print(grouped_data)

                     Pregnancies  Glucose  BloodPressure  SkinThickness  \
Age Pregnancies                                                           
21  0           220            0      177             60             29   
                713            0      134             58             20   
                511            0      139             62             17   
                414            0      138             60             35   
                307            0      137             68             14   
...                          ...      ...            ...            ...   
69  5           123            5      132             80              0   
                684            5      136             82              0   
70  4           666            4      145             82             18   
72  2           453            2      119              0              0   
81  9           459            9      134             74             33   

                     Ins

**to_csv()**: Using to_csv(), you can easily save your DataFrame as a CSV file, which can be opened and read by various software applications, including spreadsheet programs like Microsoft Excel. This function is valuable for data storage, sharing, and transferring data between different platforms or tools.  
The general syntax for to_csv() is df.to_csv('file_path.csv', sep=',', index=False), where df is the DataFrame you want to save, 'file_path.csv' is the desired file path and name, sep specifies the delimiter used to separate the values (default is a comma), and index indicates whether to include the row index in the output file (default is True).

In [32]:
# Write the results to a new CSV file
df.to_csv('diabetes_analysis.csv', index=False)