# Pandas tutorials

## What is Pandas?

### Pandas is one of the most important libraries of Python. Pandas has data structures for data analysis. The most used of these are Series and DataFrame data structures. Series is one dimensional, that is, it consists of a column. Data frame is two-dimensional, i.e. it consists of rows and columns.

### To install Pandas, you can use "pip install pandas"

## Jupyter for beginners
## Links: https://daily.dev/blog/jupyter-for-beginners#:~:text=Here%27s%20a%20quick%20guide%20to%20get%20you%20started%3A,Dive%20into%20writing%20and%20running%20code.%20More%20items

# Pandas First Steps
## Install and import
## Pandas is an easy package to install. Open up your terminal program (for Mac users) or command line (for PC users) and install it using either of the following commands:

## conda install pandas

## OR

## pip install pandas

## Alternatively, if you're currently viewing this article in a Jupyter notebook you can run this cell:

## !pip install pandas

## Now to the basic components of pandas

# Core components of pandas: Series and DataFrames
## The primary two components of pandas are the Series and DataFrame.

## A Series is essentially a column, and a DataFrame is a multi-dimensional table made up of a collection of Series.

![image.png](attachment:761cf7cc-121b-4c99-ad31-78877a8fb035.png)![image.png](attachment:29e13c95-c0a5-4bca-b1e5-6e8f4340f9c1.png)


# Installation of Pandas

In [1]:
!pip install pandas




[notice] A new release of pip is available: 23.3.1 -> 24.2
[notice] To update, run: python.exe -m pip install --upgrade pip


# Upgrade the pip package manager

In [2]:
pip install --upgrade pip

Collecting pip
  Downloading pip-24.2-py3-none-any.whl.metadata (3.6 kB)
Downloading pip-24.2-py3-none-any.whl (1.8 MB)
   ---------------------------------------- 0.0/1.8 MB ? eta -:--:--
   -------------- ------------------------- 0.6/1.8 MB 13.8 MB/s eta 0:00:01
   ---------------------------------------- 1.8/1.8 MB 23.1 MB/s eta 0:00:00
Installing collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 23.3.1
    Uninstalling pip-23.3.1:
      Successfully uninstalled pip-23.3.1
Successfully installed pip-24.2
Note: you may need to restart the kernel to use updated packages.


# Import Pandas:
## Once Pandas is installed, import it in your applications by adding the import keyword:

In [1]:
import pandas

## Example

In [3]:
mydataset = {
             'cars': ["BMW", "Volvo", "Ford"],
            'passings': [3,7,2]
            }
myvar = pandas.DataFrame(mydataset)
print(myvar)

    cars  passings
0    BMW         3
1  Volvo         7
2   Ford         2


## Pandas as pd:
## Pandas is usually imported under the pd alias.

## alias: In Python alias are an alternate name for referring to the same thing.

## Create an alias with the as keyword while importing:

In [5]:
import pandas as pd

## Now the Pandas package can be referred to as pd instead of pandas.

## Example

In [6]:
import pandas as pd

mydataset = {
  'cars': ["BMW", "Volvo", "Ford"],
  'passings': [3, 7, 2]
}

myvar = pd.DataFrame(mydataset)

print(myvar)

    cars  passings
0    BMW         3
1  Volvo         7
2   Ford         2


# Checking Pandas Version:
## The version string is stored under __version__ attribute.

In [7]:
#Example
import pandas as pd

print(pd.__version__)

2.1.1


# _What is a Series?_
## - A Pandas Series is like a column in a table.

## - It is a one-dimensional array holding data of any type.

### Example-
#### Create a simple Pandas Series from a list:

In [10]:

import pandas as pd

a = [1, 7, 2]

myvar = pd.Series(a)

print(myvar)

0    1
1    7
2    2
dtype: int64


# Labels:
## If nothing else is specified, the values are labeled with their index number. First value has index 0, second value has index 1 etc.

## This label can be used to access a specified value.

In [11]:
# Example
# Return the first value of the Series:

print(myvar[0])

1


# Create Labels:
## With the index argument, you can name your own labels.

In [12]:
# Example
# Create your own labels:

import pandas as pd

a = [1, 7, 2]

myvar = pd.Series(a, index = ["A", "B", "C"])

print(myvar)

A    1
B    7
C    2
dtype: int64


### When you have created labels, you can access an item by referring to the label.

### Example:
### Return the value of "y":

In [15]:
print(myvar["B"])

7


# Key/Value Objects as Series:
## You can also use a key/value object, like a dictionary, when creating a Series.

In [16]:
# Example-
# Create a simple Pandas Series from a dictionary:

import pandas as pd

calories = {"day1": 420, "day2": 380, "day3": 390}

myvar = pd.Series(calories)

print(myvar)

day1    420
day2    380
day3    390
dtype: int64


### Note: The keys of the dictionary become the labels.

## To select only some of the items in the dictionary, use the index argument and specify only the items you want to include in the Series.

In [17]:
#Example-
# Create a Series using only data from "day1" and "day2":

import pandas as pd

calories = {"day1": 420, "day2": 380, "day3": 390}

myvar = pd.Series(calories, index = ["day1", "day2"])

print(myvar)

day1    420
day2    380
dtype: int64


# DataFrames:
## What is a DataFrame?
### - A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table with rows and columns.
### - Data sets in Pandas are usually multi-dimensional tables, called DataFrames.

### - Series is like a column, a DataFrame is the whole table.

### Example-
### Create a DataFrame from two Series:

In [20]:
import pandas as pd

data = {
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}

# load data into a DataFrame object:
myvar = pd.DataFrame(data)

print(myvar)

   calories  duration
0       420        50
1       380        40
2       390        45


## Locate Row:
### - As you can see from the result above, the DataFrame is like a table with rows and columns.

### - Pandas use the _loc_ attribute to return one or more specified row(s)

### Example-
### Return row 0:

In [22]:
#refer to the row index:
print(myvar.loc[0])

calories    420
duration     50
Name: 0, dtype: int64


### Note: This example returns a Pandas Series.

### Example-
### Return row 0 and 1:

In [25]:
#use a list of indexes:
print(myvar.loc[[0, 1]])

   calories  duration
0       420        50
1       380        40


## Note: When using list-[], the result is a Pandas DataFrame.

## Named Indexes:
### With the index argument, you can name your own indexes.

In [27]:
import pandas as pd

data = {
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}

df = pd.DataFrame(data, index = ["day1", "day2", "day3"])

print(df) 

      calories  duration
day1       420        50
day2       380        40
day3       390        45


## Locate Named Indexes:
### Use the named index in the loc attribute to return the specified row(s).
#### Example-
#### Return "day2":

In [28]:
#refer to the named index:
print(df.loc["day2"])

calories    380
duration     40
Name: day2, dtype: int64


# Pandas Read CSV
## Read CSV Files:
### - A simple way to store big data sets is to use CSV files (comma separated files).
### - CSV files contains plain text and is a well know format that can be read by everyone including Pandas.
### - In our examples we will be using a CSV file called 'data.csv'.
### Link: https://drive.google.com/file/d/1oeJY42VtjD91oBQ8GmK50UZbKap8q2DK/view?usp=sharing

In [1]:
# Example
# Load the CSV into a DataFrame:

import pandas as pd

df = pd.read_csv('data.csv')

print(df.to_string()) 

     Duration  Pulse  Maxpulse  Calories
0          60    110       130     409.1
1          60    117       145     479.0
2          60    103       135     340.0
3          45    109       175     282.4
4          45    117       148     406.0
5          60    102       127     300.0
6          60    110       136     374.0
7          45    104       134     253.3
8          30    109       133     195.1
9          60     98       124     269.0
10         60    103       147     329.3
11         60    100       120     250.7
12         60    106       128     345.3
13         60    104       132     379.3
14         60     98       123     275.0
15         60     98       120     215.2
16         60    100       120     300.0
17         45     90       112       NaN
18         60    103       123     323.0
19         45     97       125     243.0
20         60    108       131     364.2
21         45    100       119     282.0
22         60    130       101     300.0
23         45   

### Tip: use to_string() to print the entire DataFrame.
#### If you have a large DataFrame with many rows, Pandas will only return the first 5 rows, and the last 5 rows:
### Example
#### Print the DataFrame without the to_string() method:

In [2]:
import pandas as pd

df = pd.read_csv('data.csv')

print(df) 

     Duration  Pulse  Maxpulse  Calories
0          60    110       130     409.1
1          60    117       145     479.0
2          60    103       135     340.0
3          45    109       175     282.4
4          45    117       148     406.0
..        ...    ...       ...       ...
164        60    105       140     290.8
165        60    110       145     300.0
166        60    115       145     310.2
167        75    120       150     320.4
168        75    125       150     330.4

[169 rows x 4 columns]


# max_rows:
### The number of rows returned is defined in Pandas option settings.

### You can check your system's maximum rows with the pd.options.display.max_rows statement.
### Example-
### Check the number of maximum returned rows:

In [4]:
import pandas as pd

print(pd.options.display.max_rows) 

60


### In my system the number is 60, which means that if the DataFrame contains more than 60 rows, the print(df) statement will return only the headers and the first and last 5 rows.

### You can change the maximum rows number with the same statement.
### Example-
### Increase the maximum number of rows to display the entire DataFrame:

In [8]:
import pandas as pd

pd.options.display.max_rows = 9999

df = pd.read_csv('data.csv')

print(df) 

     Duration  Pulse  Maxpulse  Calories
0          60    110       130     409.1
1          60    117       145     479.0
2          60    103       135     340.0
3          45    109       175     282.4
4          45    117       148     406.0
5          60    102       127     300.0
6          60    110       136     374.0
7          45    104       134     253.3
8          30    109       133     195.1
9          60     98       124     269.0
10         60    103       147     329.3
11         60    100       120     250.7
12         60    106       128     345.3
13         60    104       132     379.3
14         60     98       123     275.0
15         60     98       120     215.2
16         60    100       120     300.0
17         45     90       112       NaN
18         60    103       123     323.0
19         45     97       125     243.0
20         60    108       131     364.2
21         45    100       119     282.0
22         60    130       101     300.0
23         45   

# Pandas Read JSON
## Read JSON
### Big data sets are often stored, or extracted as JSON.

### JSON is plain text, but has the format of an object, and is well known in the world of programming, including Pandas.

### In our examples we will be using a JSON file called 'sample.json'.