•	Installation

•	Core components of pandas: Series and DataFrames (DF)

•	Reading in/Writing data to/from files

•	Cleaning data

## What is pandas?

1. Pandas is a data manipulation package in Python for tabular data. 
- Data in the form of rows and columns, also known as DataFrames. 
- Intuitively, you can think of a DataFrame as an Excel sheet. 

2. Pandas’ functionality
- Data transformations, like sorting rows and taking subsets
- Calculating summary statistics such as the mean
- Reshaping DataFrames, and joining DataFrames together

In [44]:
# Constructing DataFrame from a dictionary.
d = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=d)
df

Unnamed: 0,col1,col2
0,1,3
1,2,4


In [27]:
df['col2']

0    3
1    4
Name: col2, dtype: int64

In [32]:
df.head(1)

Unnamed: 0,col1,col2
0,1,3


In [28]:
df['col1'][1]

2

## What is pandas used for?
- Import datasets from databases, spreadsheets, comma-separated values (CSV) files, and more.
- Clean datasets, for example, by dealing with missing values.
- Tidy datasets by reshaping their structure into a suitable format for analysis.
- Aggregate data by calculating summary statistics such as the mean of columns, correlation between them, and more.
- Visualize datasets and uncover insights.

## Key benefits of the pandas package

1. Made for Python: Python is the world's most popular language for machine learning and data science.

2. Less verbose per unit operations: Code written in pandas is less verbose, requiring fewer lines of code to get the desired output. 

3. Intuitive view of data: pandas offers exceptionally intuitive data representation that facilitates easier data understanding and analysis.

4. Extensive feature set: It supports an extensive set of operations from exploratory data analysis, dealing with missing values, calculating statistics, visualizing univariate and bivariate data, and much more.

5. Works with large data: pandas handles large data sets with ease. It offers speed and efficiency while working with datasets of the order of millions of records and hundreds of columns, depending on the machine.

## Pandas Installation

In [1]:
# Install pandas
# pip install pandas
# Or you can also try below
%pip install pandas

Note: you may need to restart the kernel to use updated packages.


## Importing data in pandas

- To begin working with pandas, import the pandas Python package. 
- When importing pandas, the most common alias for pandas is pd.

In [2]:
import pandas as pd

### 1. Importing CSV files

1. Use read_csv() with the path to the CSV file to read a comma-separated values file
2. This read operation loads the CSV file diabetes.csv to generate a pandas Dataframe object df. 
3. DataFrame.dtypes
- This returns a Series with the data type of each column. The result’s index is the original DataFrame’s columns. Columns with mixed types are stored with the object dtype.

In [30]:
# Generate a pandas Dataframe object called df
df = pd.read_csv("diabetes.csv")

In [5]:
df

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.340,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1


### 2. Importing text files

1. The separator argument refers to the symbol used to separate rows in a DataFrame. 

2. Comma (sep = ","), whitespace(sep = "\s"), tab (sep = "\t"), and colon(sep = ":") are the commonly used separators.

In [9]:
df = pd.read_csv("diabetes.txt", sep=",")
df

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.340,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1


### 3. Importing Excel files (single sheet)
1. Reading excel files (both XLS and XLSX) is as easy as the read_excel() function, using the file path as an input.

2. You can also specify header which row becomes the DataFrame's header. 
- It has a default value of 0, which denotes the first row as headers or column names. 

In [62]:
df = pd.read_excel('diabetes.xlsx', header = 0)
df

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.340,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1


### 4. Importing Excel files (multiple sheets)

You just need to specify one additional argument, sheet_name, where you can either pass a string for the sheet name or an integer for the sheet position (note that Python uses 0-indexing, where the first sheet can be accessed with sheet_name = 0)

In [36]:
df = pd.read_excel('diabetes_multi.xlsx', header = 0)
df

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.340,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1


In [34]:
df = pd.read_excel('diabetes_multi.xlsx',sheet_name=1, header = 0)
df

Unnamed: 0,Test 1,Test 2,Test 3,Test 4,Test 5
0,87,79,91,82,94
1,72,79,81,74,88
2,94,92,81,89,96
3,77,56,67,81,79
4,79,82,85,81,90


In [41]:
df = pd.read_excel('diabetes_multi.xlsx',sheet_name='test_scores', header = 0)
df

Unnamed: 0,Test 1,Test 2,Test 3,Test 4,Test 5
0,87,79,91,82,94
1,72,79,81,74,88
2,94,92,81,89,96
3,77,56,67,81,79
4,79,82,85,81,90


In [42]:
df.dtypes

Test 1    int64
Test 2    int64
Test 3    int64
Test 4    int64
Test 5    int64
dtype: object

### 5. Importing JSON file
you can use read_json() for JSON file types with the JSON file name as the argument.

- JSON, also known as JavaScript Object Notation, is a data-interchange text-serialization format. JSON is easy to read and write. It is based on a subset of the JavaScript Programming Language but uses conventions from Python, and many other languages outside of Python.

- JSON is mostly used to store unstructured data, and SQL databases have a tough time saving it. JSON makes the data accessible for the machines to read.

- JSON is mainly built on two structures:

    - A collection of key/value pairs. In Python, a key/value pair is referred to as a Dictionary, and a key is a unique attribute, whereas values are not.

    - An ordered list of values. The ordered list can also sometimes be a list of lists. Lists in Python are a set of values which can be a string, integer, etc.

In [65]:
df = pd.read_json("diabetes.json")
df

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.340,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1


If you want to learn more about importing data with pandas, check out this cheat sheet on importing various file types with Python.
https://www.datacamp.com/cheat-sheet/importing-data-in-python-cheat-sheet

## Outputting data in pandas

- Just as pandas can import data from various file types, it also allows you to export data into various formats. 
- This happens especially when data is transformed using pandas and needs to be saved locally on your machine. 

### 1. Outputting a DataFrame into a CSV file

- A pandas DataFrame (here we are using df) is saved as a CSV file using the .to_csv() method. 
- The arguments include the filename with path and index – where index = True implies writing the DataFrame’s index.
    - The index=False argument specifies that the index column should not be included in the output file.
    - This is useful when the index is not meaningful or when it is already included as a separate column in the DataFrame.

In [63]:
# This code saves a pandas DataFrame df to a CSV file 
# named "diabetes_out.csv" in the current working directory.
df.to_csv("diabetes_out.csv", index=False)

### 2. Outputting a DataFrame into a JSON file
Export DataFrame object into a JSON file by calling the .to_json() method.

In [64]:
df.to_json("diabetes_out.json")

### 3. Outputting a DataFrame into a text file

- As with writing DataFrames to CSV files, you can call .to_csv(). 
- The only differences are that the output file format is in .txt, and you need to specify a separator using the sep argument.

In [66]:
df.to_csv('diabetes_out.txt', header=df.columns, index=None, sep=' ')

- The header parameter is set to df.columns, which means that the column names of the DataFrame will be included as the first row in the output file.
- The index parameter is set to None, which means that the row index of the DataFrame will not be included in the output file.
- The sep parameter is set to a space character, which means that the values in each row will be separated by a space in the output file.

### 4. Outputting a DataFrame into an Excel file

Call .to_excel() from the DataFrame object to save it as a “.xls” or “.xlsx” file.

In [67]:
df.to_excel("diabetes_out.xlsx", index=False)

## Viewing and understanding DataFrames using pandas 

- Row: observation
- Column: variable
- Every value within a column has the same data type, either text or numeric, but different columns can contain different data types.

In [95]:
df.dtypes

Pregnancies                   int64
Glucose                       int64
BloodPressure                 int64
SkinThickness                 int64
Insulin                       int64
BMI                         float64
DiabetesPedigreeFunction    float64
Age                           int64
Outcome                       int64
dtype: object

In [96]:
type(df)

pandas.core.frame.DataFrame

### 1. How to view data using .head() and .tail()
- You can view the first few or last few rows of a DataFrame using the .head() or .tail() methods, respectively. 
- You can specify the number of rows through the n argument (the default value is 5).

In [78]:
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [79]:
df.head(7)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
5,5,116,74,0,0,25.6,0.201,30,0
6,3,78,50,32,88,31.0,0.248,26,1


In [80]:
df.tail()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.34,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1
767,1,93,70,31,0,30.4,0.315,23,0


In [81]:
df.tail(n=3)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1
767,1,93,70,31,0,30.4,0.315,23,0


### 2. Understanding data using .describe()

The DataFrame.describe() method prints the summary statistics of all numeric columns, such as count, mean, standard deviation, range, and quartiles of numeric columns.
- The describe method computes some summary statistics for numerical columns, like mean and median.
- "count": the number of non-missing values in each column
- "25%; 50%; 75%": All percentiles should fall between 0 and 1. 
    - The default is [.25, .5, .75], which returns the 25th, 50th, and 75th percentiles.
    - The 50 percentile is the same as the median.
- "std": Standard deviation of the observations.


In [82]:
df.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


- You can also modify the quartiles using the percentiles argument. 
- Here, for example, we’re looking at the 30%, 50%, and 70% percentiles of the numeric columns in DataFrame df.

In [83]:
df.describe(percentiles=[0.3, 0.5, 0.7])

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
30%,1.0,102.0,64.0,8.2,0.0,28.2,0.259,25.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
70%,5.0,134.0,78.0,31.0,106.0,35.49,0.5637,38.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


- You can also isolate specific data types in your summary output by using the include argument. 
- Here, for example, we’re only summarizing the columns with the integer data type. 

In [84]:
df.describe(include=[int])

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,81.0,1.0


- Similarly, you might want to exclude certain data types using exclude argument.

In [85]:
df.describe(exclude=[int])

Unnamed: 0,BMI,DiabetesPedigreeFunction
count,768.0,768.0
mean,31.992578,0.471876
std,7.88416,0.331329
min,0.0,0.078
25%,27.3,0.24375
50%,32.0,0.3725
75%,36.6,0.62625
max,67.1,2.42


- Often, practitioners find it easy to view such statistics by transposing them with the .T attribute.

In [86]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Pregnancies,768.0,3.845052,3.369578,0.0,1.0,3.0,6.0,17.0
Glucose,768.0,120.894531,31.972618,0.0,99.0,117.0,140.25,199.0
BloodPressure,768.0,69.105469,19.355807,0.0,62.0,72.0,80.0,122.0
SkinThickness,768.0,20.536458,15.952218,0.0,0.0,23.0,32.0,99.0
Insulin,768.0,79.799479,115.244002,0.0,0.0,30.5,127.25,846.0
BMI,768.0,31.992578,7.88416,0.0,27.3,32.0,36.6,67.1
DiabetesPedigreeFunction,768.0,0.471876,0.331329,0.078,0.24375,0.3725,0.62625,2.42
Age,768.0,33.240885,11.760232,21.0,24.0,29.0,41.0,81.0
Outcome,768.0,0.348958,0.476951,0.0,0.0,0.0,1.0,1.0


For more on describing DataFrames, check out the following cheat sheet:

https://www.datacamp.com/cheat-sheet/pandas-cheat-sheet-data-wrangling-in-python

### 3. Understanding data using .info( )

The .info( ) method is a quick way to look at the data types, missing values, and data size of a DataFrame. 
- Here, we’re setting the show_counts argument to True, which gives a few over the total non-missing values in each column. 
- We’re also setting memory_usage to True, which shows the total memory usage of the DataFrame elements. 
- When verbose is set to True, it prints the full summary from .info(). 

In [88]:
df.info(show_counts=True, memory_usage=True, verbose=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 60.0 KB


In [89]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 60.0 KB


### 4. Understanding your data using .shape

- The number of rows and columns of a DataFrame can be identified using the .shape attribute of the DataFrame. 
- It returns a tuple (row, column) and can be indexed to get only rows, and only columns count as output.

In [91]:
df.shape # Get the number of rows and columns

(768, 9)

In [92]:
df.shape[0] # Get the number of rows only

768

In [93]:
df.shape[1] # Get the number of columns only

9

### 5. Get all columns and column names

- Calling the .columns attribute of a DataFrame object returns the column names in the form of an Index object. 
- As a reminder, a pandas index is the address/label of the row or column.

In [94]:
df.columns

Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'],
      dtype='object')

In [97]:
df.columns[0]

'Pregnancies'

- It can be converted to a list using a list() function.

In [98]:
list(df.columns)

['Pregnancies',
 'Glucose',
 'BloodPressure',
 'SkinThickness',
 'Insulin',
 'BMI',
 'DiabetesPedigreeFunction',
 'Age',
 'Outcome']

In [100]:
list(df.columns)[0]

'Pregnancies'

- .index: the index atributes contains row numbers or row names.

In [101]:
df.index

Int64Index([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,
            ...
            758, 759, 760, 761, 762, 763, 764, 765, 766, 767],
           dtype='int64', length=768)

### 6. Checking for missing values in pandas with .isnull( )

The sample DataFrame does not have any missing values. 

Let's introduce a few to make things interesting. 
- The .copy( ) method makes a copy of the original DataFrame. This is done to ensure that any changes to the copy don’t reflect in the original DataFrame. 
- Using .loc (to be discussed later), you can set rows two to five of the Pregnancies column to NaN values, which denote missing values.
    - NaN (Not a Number): NaN represents missing or undefined data in Python. 
    - It is typically encountered while performing mathematical operations that result in an undefined or nonsensical value. NaN is a floating-point value represented by the float('nan') object in Python.

In [11]:
df2 = df.copy()
df2.loc[2:5,'Pregnancies'] = None
df2.head(7)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6.0,148,72,35,0,33.6,0.627,50,1
1,1.0,85,66,29,0,26.6,0.351,31,0
2,,183,64,0,0,23.3,0.672,32,1
3,,89,66,23,94,28.1,0.167,21,0
4,,137,40,35,168,43.1,2.288,33,1
5,,116,74,0,0,25.6,0.201,30,0
6,3.0,78,50,32,88,31.0,0.248,26,1


- You can check whether each element in a DataFrame is missing using the .isnull( ) method.

In [105]:
df2.isnull().head(7)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False
2,True,False,False,False,False,False,False,False,False
3,True,False,False,False,False,False,False,False,False
4,True,False,False,False,False,False,False,False,False
5,True,False,False,False,False,False,False,False,False
6,False,False,False,False,False,False,False,False,False


- Given it's often more useful to know how much missing data you have, you can combine .isnull() with .sum() to count the number of nulls in each column.

In [106]:
df2.isnull().sum()

Pregnancies                 4
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

- You can also do a double sum to get the total number of nulls in the DataFrame.

In [110]:
df2.isnull().sum().sum()

4

## Slicing and Extracting Data in pandas

The pandas package offers several ways to subset, filter, and isolate data in your DataFrames.

### 1. Isolating one column using [ ]

- You can isolate a single column using a square bracket [ ] with a column name in it. 
- The output is a pandas Series object. 
- A pandas Series is a one-dimensional array containing data of any type, including integer, float, string, boolean, python objects, etc. 
    - Array: a special varilable, which can hoold more than one value at a time. You can access the values by referring to an index number.
    - Main different between list and array: https://www.geeksforgeeks.org/difference-between-list-and-array-in-python/
        - List can consist of elements belonging to different data types
        - Array only consists of elements belonging to the same data type
    - Main different between list and series: https://www.geeksforgeeks.org/creating-a-pandas-series-from-lists/?ref=header_search
        - List can consist of elements belonging to different data types
        - Series will always contain data of the same type
- A DataFrame is comprised of many series that act as columns.


In [161]:
df['Outcome']

0      1
1      0
2      1
3      0
4      1
      ..
763    0
764    0
765    0
766    1
767    0
Name: Outcome, Length: 768, dtype: int64

### 2. Isolating two or more columns using [[ ]] 
- You can also provide a list of column names inside the square brackets to fetch more than one column. 
- Here, square brackets are used in two different ways. 
    - We use the outer square brackets to indicate a subset of a DataFrame, and the inner square brackets to create a list.

In [7]:
df[['Pregnancies', 'Outcome']]

Unnamed: 0,Pregnancies,Outcome
0,6,1
1,1,0
2,8,1
3,1,0
4,0,1
...,...,...
763,10,0
764,2,0
765,5,0
766,1,1


### 3. Isolating one row using [ ] 
- A single row can be fetched by passing in a boolean series with one True value. 
- In the example below, the second row with index = 1 is returned. 
    - Here, .index returns the row labels of the DataFrame, and the comparison turns that into a Boolean one-dimensional array.


In [9]:
df[df.index==1]

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
1,1,85,66,29,0,26.6,0.351,31,0


### 4. Isolating two or more rows using [ ] 
Similarly, two or more rows can be returned using the .isin() method instead of a == operator.

In [195]:
df[df.index.isin(range(2,10))]

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
5,5,116,74,0,0,25.6,0.201,30,0
6,3,78,50,32,88,31.0,0.248,26,1
7,10,115,0,0,0,35.3,0.134,29,0
8,2,197,70,45,543,30.5,0.158,53,1
9,8,125,96,0,0,0.0,0.232,54,1


### 5. Using .loc[] and .iloc[] to fetch rows
You can fetch specific rows by labels or conditions using .loc[] and .iloc[] ("location" and "integer location"). 
- .loc[] uses a label to point to a row, column or cell
- .iloc[] uses the numeric position (staring from 0 and going up by one for each row).

In [171]:
df2.index = range(1,769)
df2

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
1,6.0,148,72,35,0,33.6,0.627,50,1
2,1.0,85,66,29,0,26.6,0.351,31,0
3,,183,64,0,0,23.3,0.672,32,1
4,,89,66,23,94,28.1,0.167,21,0
5,,137,40,35,168,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
764,10.0,101,76,48,180,32.9,0.171,63,0
765,2.0,122,70,27,0,36.8,0.340,27,0
766,5.0,121,72,23,112,26.2,0.245,30,0
767,1.0,126,60,0,0,30.1,0.349,47,1


In [172]:
# The below example returns a pandas Series instead of a DataFrame. 
# The 1 represents the row index (label)
df2.loc[1]

Pregnancies                   6.000
Glucose                     148.000
BloodPressure                72.000
SkinThickness                35.000
Insulin                       0.000
BMI                          33.600
DiabetesPedigreeFunction      0.627
Age                          50.000
Outcome                       1.000
Name: 1, dtype: float64

In [173]:
# the 1 in .iloc[] is the row position (second row).
df2.iloc[1]

Pregnancies                  1.000
Glucose                     85.000
BloodPressure               66.000
SkinThickness               29.000
Insulin                      0.000
BMI                         26.600
DiabetesPedigreeFunction     0.351
Age                         31.000
Outcome                      0.000
Name: 2, dtype: float64

### - You can also fetch multiple rows by providing a range in square brackets.

In [176]:
df2.loc[100:110]

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
100,1.0,122,90,51,220,49.7,0.325,31,1
101,1.0,163,72,0,0,39.0,1.222,33,1
102,1.0,151,60,0,0,26.1,0.179,22,0
103,0.0,125,96,0,0,22.5,0.262,21,0
104,1.0,81,72,18,40,26.6,0.283,24,0
105,2.0,85,65,0,0,39.6,0.93,27,0
106,1.0,126,56,29,152,28.7,0.801,21,0
107,1.0,96,122,0,0,22.4,0.207,27,0
108,4.0,144,58,28,140,29.5,0.287,37,0
109,3.0,83,58,31,18,34.3,0.336,25,0


In [177]:
df2.iloc[100:110]

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
101,1.0,163,72,0,0,39.0,1.222,33,1
102,1.0,151,60,0,0,26.1,0.179,22,0
103,0.0,125,96,0,0,22.5,0.262,21,0
104,1.0,81,72,18,40,26.6,0.283,24,0
105,2.0,85,65,0,0,39.6,0.93,27,0
106,1.0,126,56,29,152,28.7,0.801,21,0
107,1.0,96,122,0,0,22.4,0.207,27,0
108,4.0,144,58,28,140,29.5,0.287,37,0
109,3.0,83,58,31,18,34.3,0.336,25,0
110,0.0,95,85,25,36,37.4,0.247,24,1


### - You can also subset with .loc[] and .iloc[] by using a list instead of a range.

In [178]:
df2.loc[[100, 200, 300]]

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
100,1.0,122,90,51,220,49.7,0.325,31,1
200,4.0,148,60,27,318,30.9,0.15,29,1
300,8.0,112,72,0,0,23.6,0.84,58,0


In [179]:
df2.iloc[[100, 200, 300]]

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
101,1.0,163,72,0,0,39.0,1.222,33,1
201,0.0,113,80,16,0,31.0,0.874,21,0
301,0.0,167,0,0,0,32.3,0.839,30,1


### - You can also select specific columns along with rows. 
This is where .iloc[] is different from .loc[] – it requires column location and not column labels.

In [180]:
df2.loc[100:110, ['Pregnancies', 'Glucose', 'BloodPressure']]

Unnamed: 0,Pregnancies,Glucose,BloodPressure
100,1.0,122,90
101,1.0,163,72
102,1.0,151,60
103,0.0,125,96
104,1.0,81,72
105,2.0,85,65
106,1.0,126,56
107,1.0,96,122
108,4.0,144,58
109,3.0,83,58


In [199]:
df2.iloc[100:110, :3]

Unnamed: 0,Pregnancies,Glucose,BloodPressure
100,1.0,163,72
101,1.0,151,60
102,0.0,125,96
103,1.0,81,72
104,2.0,85,65
105,1.0,126,56
106,1.0,96,122
107,4.0,144,58
108,3.0,83,58
109,0.0,95,85


### - For faster workflows, you can pass in the starting index of a row as a range.

In [184]:
df2.loc[760:, ['Pregnancies', 'Glucose', 'BloodPressure']]

Unnamed: 0,Pregnancies,Glucose,BloodPressure
760,6.0,190,92
761,2.0,88,58
762,9.0,170,74
763,9.0,89,62
764,10.0,101,76
765,2.0,122,70
766,5.0,121,72
767,1.0,126,60
768,1.0,93,70


In [185]:
df2.iloc[760:, :3]

Unnamed: 0,Pregnancies,Glucose,BloodPressure
761,2.0,88,58
762,9.0,170,74
763,9.0,89,62
764,10.0,101,76
765,2.0,122,70
766,5.0,121,72
767,1.0,126,60
768,1.0,93,70


#### You can update/modify certain values by using the assignment operator =

In [12]:
df2.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6.0,148,72,35,0,33.6,0.627,50,1
1,1.0,85,66,29,0,26.6,0.351,31,0
2,,183,64,0,0,23.3,0.672,32,1
3,,89,66,23,94,28.1,0.167,21,0
4,,137,40,35,168,43.1,2.288,33,1


In [13]:
df2.loc[df['Age']==50, ['Age']] = 51
df2

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6.0,148,72,35,0,33.6,0.627,51,1
1,1.0,85,66,29,0,26.6,0.351,31,0
2,,183,64,0,0,23.3,0.672,32,1
3,,89,66,23,94,28.1,0.167,21,0
4,,137,40,35,168,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10.0,101,76,48,180,32.9,0.171,63,0
764,2.0,122,70,27,0,36.8,0.340,27,0
765,5.0,121,72,23,112,26.2,0.245,30,0
766,1.0,126,60,0,0,30.1,0.349,47,1


### Conditional slicing (that fits certain conditions)

pandas lets you filter data by conditions over row/column values. 

#### Isolating rows based on a condition in pandas 
For example, the below code selects the row where Blood Pressure is exactly 122. 
- Here, we are isolating rows using the brackets [ ] as seen in previous sections. 
- However, instead of inputting row indices or column names, we are inputting a condition where the column BloodPressure is equal to 122. 
- We denote this condition using df.BloodPressure == 122.

In [208]:
df[df.BloodPressure == 122]

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
106,1,96,122,0,0,22.4,0.207,27,0


Q: Can you fetched all rows where Outcome is 1?

- Here df.Outcome selects that column, 
- df.Outcome == 1 returns a Series of Boolean values determining which Outcomes are equal to 1, 
- then [] takes a subset of df where that Boolean Series is True.

#### Isolating rows and columns based on a condition in pandas 
You can use a > operator to draw comparisons. 

The below code fetches Pregnancies, Glucose, and BloodPressure for all records with BloodPressure greater than 100.

In [210]:
df.loc[df['BloodPressure'] > 100, ['Pregnancies', 'Glucose', 'BloodPressure']]

Unnamed: 0,Pregnancies,Glucose,BloodPressure
43,9,171,110
84,5,137,108
106,1,96,122
177,0,129,110
207,5,162,104
362,5,103,108
369,1,133,102
440,0,189,104
549,4,189,110
658,11,127,106


## Cleaning data using pandas 
Data cleaning is one of the most common tasks in data science. 
- pandas lets you preprocess data for any use, including but not limited to training machine learning and deep learning models. 
    - Let’s use the DataFrame df2 from earlier, having four missing values, to illustrate a few data cleaning use cases. 
    - As a reminder, here's how you can see how many missing values are in a DataFrame.

In [211]:
df2.isnull().sum()

Pregnancies                 4
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

In [214]:
df2.shape

(768, 9)

### Dealing with missing data technique #1: Dropping missing values

One way to deal with missing data is to drop it. 
- This is particularly useful in cases where you have plenty of data and losing a small portion won’t impact the downstream analysis. 
- You can use a .dropna() method as shown below. 
- Here, we are saving the results from .dropna() into a DataFrame df3.

In [14]:
df3 = df2.copy()
df3 = df3.dropna()
df3.shape # this is 4 rows less than df2

(764, 9)

#### The axis argument lets you specify whether you are dropping rows, or columns, with missing values. 

- The default axis removes the rows containing NaNs. 
    - Use axis = 0 to remove the rows with one or more NaN values. 
    - Use axis = 1 to remove the columns with one or more NaN values. 
- Also, notice how we are using the argument inplace=True which lets you skip saving the output of .dropna() into a new DataFrame.  

In [15]:
df3 = df2.copy()
df3.dropna(inplace=True, axis=1)
df3.head()

Unnamed: 0,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,148,72,35,0,33.6,0.627,51,1
1,85,66,29,0,26.6,0.351,31,0
2,183,64,0,0,23.3,0.672,32,1
3,89,66,23,94,28.1,0.167,21,0
4,137,40,35,168,43.1,2.288,33,1


Q: Can you remove the rows with one or more NaN values from df3?

### Dealing with missing data technique #2: Replacing missing values

Instead of dropping, replacing missing values with a summary statistic or a specific value (depending on the use case) maybe the best way to go. 
- For example, if there is one missing row from a temperature column denoting temperatures throughout the days of the week, replacing that missing value with the average temperature of that week may be more effective than dropping values completely. 

In [17]:
df3 = df2.copy()
# Get the mean of Pregnancies
mean_value = df3['Pregnancies'].mean()
# Fill missing values using .fillna()
df3 = df3.fillna(mean_value)
df3.head(7)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6.0,148,72,35,0,33.6,0.627,51,1
1,1.0,85,66,29,0,26.6,0.351,31,0
2,3.846859,183,64,0,0,23.3,0.672,32,1
3,3.846859,89,66,23,94,28.1,0.167,21,0
4,3.846859,137,40,35,168,43.1,2.288,33,1
5,3.846859,116,74,0,0,25.6,0.201,30,0
6,3.0,78,50,32,88,31.0,0.248,26,1


#### Dealing with Duplicate Data
Let's add some duplicates to the original data to learn how to eliminate duplicates in a DataFrame. 
- Here, we are using the .concat() method to concatenate the rows of the df2 DataFrame to the df2 DataFrame, adding perfect duplicates of every row in df2. 

In [23]:
df3 = pd.concat([df2, df2])
df3.shape

(1536, 9)

In [19]:
df3.info

<bound method DataFrame.info of      Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0            6.0      148             72             35        0  33.6   
1            1.0       85             66             29        0  26.6   
2            NaN      183             64              0        0  23.3   
3            NaN       89             66             23       94  28.1   
4            NaN      137             40             35      168  43.1   
..           ...      ...            ...            ...      ...   ...   
763         10.0      101             76             48      180  32.9   
764          2.0      122             70             27        0  36.8   
765          5.0      121             72             23      112  26.2   
766          1.0      126             60              0        0  30.1   
767          1.0       93             70             31        0  30.4   

     DiabetesPedigreeFunction  Age  Outcome  
0                       0.627   5

#### You can remove all duplicate rows (default) from the DataFrame using .drop_duplicates() method.

In [20]:
df3 = df3.drop_duplicates()
df3.shape

(768, 9)

In [21]:
df3.info

<bound method DataFrame.info of      Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0            6.0      148             72             35        0  33.6   
1            1.0       85             66             29        0  26.6   
2            NaN      183             64              0        0  23.3   
3            NaN       89             66             23       94  28.1   
4            NaN      137             40             35      168  43.1   
..           ...      ...            ...            ...      ...   ...   
763         10.0      101             76             48      180  32.9   
764          2.0      122             70             27        0  36.8   
765          5.0      121             72             23      112  26.2   
766          1.0      126             60              0        0  30.1   
767          1.0       93             70             31        0  30.4   

     DiabetesPedigreeFunction  Age  Outcome  
0                       0.627   5

### Renaming columns
A common data cleaning task is renaming columns. 
- With the .rename() method, you can use columns as an argument to rename specific columns.

In [24]:
# The dictionary for mapping old and new column names.
df3.rename(columns = {'DiabetesPedigreeFunction':'DPF'}, inplace = True)
df3.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DPF,Age,Outcome
0,6.0,148,72,35,0,33.6,0.627,51,1
1,1.0,85,66,29,0,26.6,0.351,31,0
2,,183,64,0,0,23.3,0.672,32,1
3,,89,66,23,94,28.1,0.167,21,0
4,,137,40,35,168,43.1,2.288,33,1


#### You can also directly assign column names as a list to the DataFrame.

In [272]:
df3.columns = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DPF', 'Age', 'Outcome', 'STF']
df3.head()

Unnamed: 0,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DPF,Age,Outcome,STF
0,6.0,148,72,35,0,33.6,0.627,51,1
1,1.0,85,66,29,0,26.6,0.351,31,0
2,,183,64,0,0,23.3,0.672,32,1
3,,89,66,23,94,28.1,0.167,21,0
4,,137,40,35,168,43.1,2.288,33,1


For more on data cleaning, and for easier, more predictable data cleaning workflows, check out the following checklist, which provides you with a comprehensive set of common data cleaning tasks:
https://www.datacamp.com/blog/infographic-data-cleaning-checklist

## Data analysis in pandas
The main value proposition of pandas lies in its quick data analysis functionality. In this section, we'll focus on a set of analysis techniques you can use in pandas.

### Summary operators (mean, mode, median)
- you can get the mean of each column value using the .mean() method
- A mode can be computed similarly using the .mode() method
- the median of each column is computed with the .median() method


In [26]:
# Printing the mean of columns in pandas
df.mean()

Pregnancies                   3.845052
Glucose                     120.894531
BloodPressure                69.105469
SkinThickness                20.536458
Insulin                      79.799479
BMI                          31.992578
DiabetesPedigreeFunction      0.471876
Age                          33.240885
Outcome                       0.348958
dtype: float64

In [32]:
# Printing the mode of columns in pandas
df.mode()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,1.0,99,70.0,0.0,0.0,32.0,0.254,22.0,0.0
1,,100,,,,,0.258,,


In [34]:
# Printing the median of columns in pandas
df.median()

Pregnancies                   3.0000
Glucose                     117.0000
BloodPressure                72.0000
SkinThickness                23.0000
Insulin                      30.5000
BMI                          32.0000
DiabetesPedigreeFunction      0.3725
Age                          29.0000
Outcome                       0.0000
dtype: float64

### Create new columns based on existing columns 

pandas provides fast and efficient computation by combining two or more columns like scalar variables. 

In [35]:
# Divides each value in the column Glucose with the corresponding value in the Insulin column 
# to compute a new column named Glucose_Insulin_Ratio.

df2['Glucose_Insulin_Ratio'] = df2['Glucose']/df2['Insulin']
df2.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome,Glucose_Insulin_Ratio
0,6.0,148,72,35,0,33.6,0.627,51,1,inf
1,1.0,85,66,29,0,26.6,0.351,31,0,inf
2,,183,64,0,0,23.3,0.672,32,1,inf
3,,89,66,23,94,28.1,0.167,21,0,0.946809
4,,137,40,35,168,43.1,2.288,33,1,0.815476


### Counting using .value_counts()

Often times you'll work with categorical values, and you'll want to count the number of observations each category has in a column. 
- Category values can be counted using the .value_counts() methods. 
    - Here, for example, we are counting the number of observations where Outcome is diabetic (1) and the number of observations where the Outcome is non-diabetic (0).
    - Adding the normalize argument returns proportions instead of absolute counts.
    - Turn off automatic sorting of results using sort argument (True by default). The default sorting is based on the counts in descending order.
    - You can also apply .value_counts() to a DataFrame object and specific columns within it instead of just a column. 

In [39]:
# Using .value_counts() in pandas
df['Outcome'].value_counts()

0    500
1    268
Name: Outcome, dtype: int64

In [38]:
# Using .value_counts() in pandas with normalization
df['Outcome'].value_counts(normalize=True)

0    0.651042
1    0.348958
Name: Outcome, dtype: float64

In [40]:
# Using .value_counts() in pandas with sorting
df['Outcome'].value_counts(sort=False)

1    268
0    500
Name: Outcome, dtype: int64

In [42]:
# Here, for example, we are applying value_counts() on df with the subset argument, 
# which takes in a list of columns. 
df.value_counts(subset=['Pregnancies', 'Outcome'])

Pregnancies  Outcome
1            0          106
2            0           84
0            0           73
3            0           48
4            0           45
0            1           38
5            0           36
6            0           34
1            1           29
3            1           27
7            1           25
4            1           23
8            1           22
5            1           21
7            0           20
2            1           19
9            1           18
6            1           16
8            0           16
10           0           14
             1           10
9            0           10
11           1            7
12           0            5
13           0            5
             1            5
11           0            4
12           1            4
14           1            2
15           1            1
17           1            1
dtype: int64

### Aggregating data with .groupby() in pandas

- pandas lets you aggregate values by grouping them by specific column values. 
- You can do that by combining the .groupby() method with a summary method of your choice.
- .groupby() enables grouping by more than one column by passing a list of column names.
- Any summary method can be used alongside .groupby(), including .min(), .max(), .mean(), .median(), .sum(), .mode(), and more.

In [44]:
# the mean of each of the numeric columns grouped by Outcome.
df.groupby('Outcome').mean()

Unnamed: 0_level_0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
Outcome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,3.298,109.98,68.184,19.664,68.792,30.3042,0.429734,31.19
1,4.865672,141.257463,70.824627,22.164179,100.335821,35.142537,0.5505,37.067164


In [45]:
df.groupby(['Pregnancies', 'Outcome']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
Pregnancies,Outcome,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,0,111.945205,69.205479,21.054795,77.561644,31.727397,0.457055,27.09589
0,1,144.236842,63.210526,24.605263,89.578947,39.213158,0.643368,28.578947
1,0,104.254717,66.830189,23.04717,84.320755,29.616038,0.451679,25.254717
1,1,143.793103,71.310345,29.517241,151.137931,37.793103,0.613759,35.103448
2,0,105.214286,61.940476,20.107143,72.619048,29.679762,0.479881,25.892857
2,1,135.473684,69.052632,28.210526,144.315789,34.578947,0.543737,32.947368
3,0,109.604167,65.708333,17.520833,62.020833,29.23125,0.358354,28.770833
3,1,148.444444,68.148148,24.62963,132.666667,32.548148,0.563333,29.481481
4,0,117.555556,71.577778,18.422222,78.466667,31.255556,0.410511,30.066667
4,1,139.913043,67.0,10.913043,51.782609,33.873913,0.516478,38.086957


In [47]:
# the mean of each of the numeric columns grouped by Outcome.
df.groupby('Outcome').max()

Unnamed: 0_level_0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
Outcome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,13,197,122,60,744,57.3,2.329,81
1,17,199,114,99,846,67.1,2.42,70


References:

https://www.datacamp.com/tutorial/pandas

https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html

https://www.datacamp.com/tutorial/importing-data-into-pandas

https://app.datacamp.com/learn/courses/data-manipulation-with-pandas

https://www.turing.com/kb/nan-values-in-python

https://www.geeksforgeeks.org/difference-between-list-and-array-in-python/

https://www.geeksforgeeks.org/creating-a-pandas-series-from-lists/?ref=header_search