In [10]:
# import libraries
import numpy as np
import matplotlib.pyplot as plt

- Pandas is a Python library used for analyzing, cleaning, exploring, and manipulating data.
-  "Pandas" refers to both "Panel Data", and "Python Data Analysis"

`Data Science`: branch of computer science where we study how to store, use and analyze data for deriving information from it.

In [None]:
!pip install pandas

In [3]:
import pandas
print(pandas.__version__)

2.2.3


In [4]:
mydataset = {
    'cars' : ['BMW', 'Volvo', 'Ford'],
    'passings' : [3, 7, 2]
}

myvar = pandas.DataFrame(mydataset)
myvar

Unnamed: 0,cars,passings
0,BMW,3
1,Volvo,7
2,Ford,2


In [5]:
import pandas as pd

In [6]:
mydataset = {
    'cars' : ['BMW', 'Volvo', 'Ford'],
    'passings' : [3, 7, 2]
}

myvar = pd.DataFrame(mydataset)
myvar

Unnamed: 0,cars,passings
0,BMW,3
1,Volvo,7
2,Ford,2


#### Pandas Series

- A Pandas Series is like a column in a table.
- It is a one-dimensional array holding data of any type.

In [8]:
a = [10, 20, 30, 40, 50]
myvar = pd.Series(a)
myvar

0    10
1    20
2    30
3    40
4    50
dtype: int64

In [13]:
arr = np.arange(10,51,5)
s = pd.Series(arr)
s

0    10
1    15
2    20
3    25
4    30
5    35
6    40
7    45
8    50
dtype: int64

Labels

- By default, values are labeled according to their index number, starting at 0 for the first value, 1 for the second, and so on.
- These index labels allow you to access specific values in the dataset.

In [15]:
print(s[5])

35


In [19]:
## create labels
# `index` argument: to name our own labels

## Create your own labels:

lt = [1, 2, 3, 4, 5]
labels = ['a','b','c','d','e']

s = pd.Series(lt, index=labels)
s

a    1
b    2
c    3
d    4
e    5
dtype: int64

In [21]:
## When you have created labels, you can access an item by referring to the label.
print(s['d'])

4


In Pandas, you can create a Series using key/value object, like a dictionary, where:
- Keys become the index (labels).
- Values become the data (entries in the Series).

In [24]:
## Create a simple Pandas Series from a dictionary:

dict = {
    'name' : 'shivam',
    'age' : 20,
    'college' : 'IIT Hyderabad'
}

ser = pd.Series(dict)
print(ser)


name              shivam
age                   20
college    IIT Hyderabad
dtype: object


In [26]:
## To select only some of the items in the dictionary, use the index argument and specify 
# only the items you want to include in the Series.

dict = {
    'name' : 'shivam',
    'age' : 20,
    'college' : 'IIT Hyderabad'
}

ser = pd.Series(dict, index = ['name','college'])
print(ser)

name              shivam
college    IIT Hyderabad
dtype: object


#### DataFrames
A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table with rows and columns.

In [31]:
data = {
    'name' : ['shivam', 'devansh', 'vishal'],
    'age' : [20, 21, 20],
    'college' : ['IIT Hyderabad', 'IIT Roorkee', 'IIT Bombay']
}

# load data into a DataFrame object
df = pd.DataFrame(data)
df

Unnamed: 0,name,age,college
0,shivam,20,IIT Hyderabad
1,devansh,21,IIT Roorkee
2,vishal,20,IIT Bombay


- Pandas use the `loc` attribute to return one or more specified row(s)
- In DataFrames, loc stands for `Label-based Locator`.

In [None]:
print(df.loc[2]) # return a series

name           vishal
age                20
college    IIT Bombay
Name: 2, dtype: object


In [None]:
print(df.loc[[0,2]]) # using [] -> return a dataframe

     name  age        college
0  shivam   20  IIT Hyderabad
2  vishal   20     IIT Bombay


`index` argument: to name our own indexes.

In [36]:
data = {
    'name' : ['shivam', 'devansh', 'vishal'],
    'age' : [20, 21, 20],
    'college' : ['IIT Hyderabad', 'IIT Roorkee', 'IIT Bombay']
}

idx = ['f1', 'f2', 'f3'] # f -> feature

# load data into a DataFrame object
df = pd.DataFrame(data, index=idx)
df

Unnamed: 0,name,age,college
f1,shivam,20,IIT Hyderabad
f2,devansh,21,IIT Roorkee
f3,vishal,20,IIT Bombay


In [None]:
df.loc['f2'] # 2nd row
# return a series

name           devansh
age                 21
college    IIT Roorkee
Name: f2, dtype: object

In [None]:
df.loc[['f1','f3']] # 1st & 3rd row
# return a dataframe

Unnamed: 0,name,age,college
f1,shivam,20,IIT Hyderabad
f3,vishal,20,IIT Bombay


In [42]:
## load files into a dataframe
# If data sets are stored in a file, Pandas can load them into a DataFrame.

df = pd.read_csv('sales_data.csv')
df



Unnamed: 0,Transaction ID,Date,Product Category,Product Name,Units Sold,Unit Price,Total Revenue,Region,Payment Method
0,10001,2024-01-01,Electronics,iPhone 14 Pro,2,999.99,1999.98,North America,Credit Card
1,10002,2024-01-02,Home Appliances,Dyson V11 Vacuum,1,499.99,499.99,Europe,PayPal
2,10003,2024-01-03,Clothing,Levi's 501 Jeans,3,69.99,209.97,Asia,Debit Card
3,10004,2024-01-04,Books,The Da Vinci Code,4,15.99,63.96,North America,Credit Card
4,10005,2024-01-05,Beauty Products,Neutrogena Skincare Set,1,89.99,89.99,Europe,PayPal
...,...,...,...,...,...,...,...,...,...
235,10236,2024-08-23,Home Appliances,Nespresso Vertuo Next Coffee and Espresso Maker,1,159.99,159.99,Europe,PayPal
236,10237,2024-08-24,Clothing,Nike Air Force 1 Sneakers,3,90.00,270.00,Asia,Debit Card
237,10238,2024-08-25,Books,The Handmaid's Tale by Margaret Atwood,3,10.99,32.97,North America,Credit Card
238,10239,2024-08-26,Beauty Products,Sunday Riley Luna Sleeping Night Oil,1,55.00,55.00,Europe,PayPal


Read CSV Files
- A simple way to store big data sets is to use CSV files (comma separated files).
- CSV files contains plain text and is a well know format that can be read by everyone including Pandas.

In [None]:
## Load the CSV into a DataFrame

df = pd.read_csv('data.csv')
df

Unnamed: 0,Duration,Pulse,Maxpulse,Calories
0,60,110,130,409.1
1,60,117,145,479.0
2,60,103,135,340.0
3,45,109,175,282.4
4,45,117,148,406.0
...,...,...,...,...
164,60,105,140,290.8
165,60,110,145,300.0
166,60,115,145,310.2
167,75,120,150,320.4


In [None]:
## If you have a large DataFrame with many rows, Pandas will only return the first 5 rows, and the last 5 rows

Tip: use `to_string()` method to print the entire DataFrame.

In [44]:
print(df.to_string())

     Duration  Pulse  Maxpulse  Calories
0          60    110       130     409.1
1          60    117       145     479.0
2          60    103       135     340.0
3          45    109       175     282.4
4          45    117       148     406.0
5          60    102       127     300.0
6          60    110       136     374.0
7          45    104       134     253.3
8          30    109       133     195.1
9          60     98       124     269.0
10         60    103       147     329.3
11         60    100       120     250.7
12         60    106       128     345.3
13         60    104       132     379.3
14         60     98       123     275.0
15         60     98       120     215.2
16         60    100       120     300.0
17         45     90       112       NaN
18         60    103       123     323.0
19         45     97       125     243.0
20         60    108       131     364.2
21         45    100       119     282.0
22         60    130       101     300.0
23         45   

max_rows
- The number of rows returned is defined in Pandas `option settings`.
- You can check your `system's maximum rows` with the `pd.options.display.max_rows` statement.

In [None]:
print(pd.options.display.max_rows)

## if the DataFrame contains more than 60 rows, the `print(df)` statement will return only 
# the headers and the first and last 5 rows.

60


In [48]:
## Increase the maximum number of rows to display the entire DataFrame:

pd.options.display.max_rows = 200
print(pd.options.display.max_rows)

200


In [49]:
df

Unnamed: 0,Duration,Pulse,Maxpulse,Calories
0,60,110,130,409.1
1,60,117,145,479.0
2,60,103,135,340.0
3,45,109,175,282.4
4,45,117,148,406.0
5,60,102,127,300.0
6,60,110,136,374.0
7,45,104,134,253.3
8,30,109,133,195.1
9,60,98,124,269.0


JSON (JavaScript Object Notation)
- JSON is plain text, but has the format of an object.
- Big data sets are often stored, or extracted as JSON.

In [50]:
df = pd.read_json('data.json')
print(df.to_string())

     Duration  Pulse  Maxpulse  Calories
0          60    110       130     409.1
1          60    117       145     479.0
2          60    103       135     340.0
3          45    109       175     282.4
4          45    117       148     406.0
5          60    102       127     300.5
6          60    110       136     374.0
7          45    104       134     253.3
8          30    109       133     195.1
9          60     98       124     269.0
10         60    103       147     329.3
11         60    100       120     250.7
12         60    106       128     345.3
13         60    104       132     379.3
14         60     98       123     275.0
15         60     98       120     215.2
16         60    100       120     300.0
17         45     90       112       NaN
18         60    103       123     323.0
19         45     97       125     243.0
20         60    108       131     364.2
21         45    100       119     282.0
22         60    130       101     300.0
23         45   

In [None]:
''' 
JSON = Python Dictionary i.e., Dictionary as JSON
JSON objects have the same format as Python dictionaries.

'''

# If your JSON code is not in a file, but in a Python Dictionary, you can load it into a DataFrame directly:

In [52]:
data = {
  "Duration":{
    "0":60,
    "1":60,
    "2":60,
    "3":45,
    "4":45,
    "5":60
  },
  "Pulse":{
    "0":110,
    "1":117,
    "2":103,
    "3":109,
    "4":117,
    "5":102
  },
  "Maxpulse":{
    "0":130,
    "1":145,
    "2":135,
    "3":175,
    "4":148,
    "5":127
  },
  "Calories":{
    "0":409,
    "1":479,
    "2":340,
    "3":282,
    "4":406,
    "5":300
  }
}

df = pd.DataFrame(data)
df

Unnamed: 0,Duration,Pulse,Maxpulse,Calories
0,60,110,130,409
1,60,117,145,479
2,60,103,135,340
3,45,109,175,282
4,45,117,148,406
5,60,102,127,300


Analyzing DataFrames: Viewing the Data
- `head() method` is used for getting a quick overview of the DataFrame.
- The `head() method` returns the headers and a specified number of rows, starting from the top.

Note: if the number of rows is not specified, the head() method will return the top 5 rows.

In [53]:
df = pd.read_csv('data.csv')
df.head(10)

Unnamed: 0,Duration,Pulse,Maxpulse,Calories
0,60,110,130,409.1
1,60,117,145,479.0
2,60,103,135,340.0
3,45,109,175,282.4
4,45,117,148,406.0
5,60,102,127,300.0
6,60,110,136,374.0
7,45,104,134,253.3
8,30,109,133,195.1
9,60,98,124,269.0


In [None]:
## Print the first 5 rows of the DataFrame
df.head(5) # or df.head() since default is 5

Unnamed: 0,Duration,Pulse,Maxpulse,Calories
0,60,110,130,409.1
1,60,117,145,479.0
2,60,103,135,340.0
3,45,109,175,282.4
4,45,117,148,406.0


- `tail() method` for viewing the last rows of the DataFrame.
- `tail() method` returns the headers and a specified number of rows, starting from the bottom.

In [56]:
df.tail(10)

Unnamed: 0,Duration,Pulse,Maxpulse,Calories
159,30,80,120,240.9
160,30,85,120,250.4
161,45,90,130,260.4
162,45,95,130,270.0
163,45,100,140,280.9
164,60,105,140,290.8
165,60,110,145,300.0
166,60,115,145,310.2
167,75,120,150,320.4
168,75,125,150,330.4


In [57]:
df.tail()

Unnamed: 0,Duration,Pulse,Maxpulse,Calories
164,60,105,140,290.8
165,60,110,145,300.0
166,60,115,145,310.2
167,75,120,150,320.4
168,75,125,150,330.4


The `DataFrames object` has a method called `info()`, that gives you more `information` about the `data set`.

In [59]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 169 entries, 0 to 168
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Duration  169 non-null    int64  
 1   Pulse     169 non-null    int64  
 2   Maxpulse  169 non-null    int64  
 3   Calories  164 non-null    float64
dtypes: float64(1), int64(3)
memory usage: 5.4 KB
None


    Empty values, or Null values, can be bad when analyzing data, and you should consider removing rows with empty values. This is a step towards what is called cleaning data.

---
Additional Notes: