# <div style="padding:20px;color:white;margin:0;font-size:35px;font-family:Georgia;text-align:left;display:fill;border-radius:5px;background-color:#254E58;overflow:hidden"><b>Day 48 - PANDAS </b></div>

## **What is Pandas?**
* Pandas is a Python library used for working with data sets.

* It has functions for analyzing, cleaning, exploring, and manipulating data.

* The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and was created by Wes McKinney in 2008.

### **Why Use Pandas?**
* Pandas allows us to analyze big data and make conclusions based on statistical theories.

* Pandas can clean messy data sets, and make them readable and relevant.

* Relevant data is very important in data science.

### **What Can Pandas Do?**
* Pandas gives you answers about the data. Like:

**Is there a correlation between two or more columns?**
* What is average value?
* Max value?
* Min value?


**Pandas are also able to delete rows that are not relevant, or contains wrong values, like empty or NULL values. This is called cleaning the data.**

### **Installation of Pandas**

**If you have Python and PIP already installed on a system, then installation of Pandas is very easy.**

**Install it using this command:**

C:\Users\Your Name>pip install pandas

### Import Pandas
**Once Pandas is installed, import it in your applications by adding the import keyword:**

In [1]:
import pandas

In [2]:
import pandas

mydataset = {
  'cars': ["BMW", "Volvo", "Ford"],
  'passings': [3, 7, 2]
}

myvar = pandas.DataFrame(mydataset)

print(myvar)

    cars  passings
0    BMW         3
1  Volvo         7
2   Ford         2


### Pandas as pd
**Pandas is usually imported under the pd alias.**

_Create an alias with the **as** keyword while importing:_

In [3]:
import pandas as pd

In [4]:
import pandas as pd

mydataset = {
  'cars': ["BMW", "Volvo", "Ford"],
  'passings': [3, 7, 2]
}

myvar = pd.DataFrame(mydataset)

print(myvar)

    cars  passings
0    BMW         3
1  Volvo         7
2   Ford         2


## Checking Pandas Version

In [5]:
import pandas as pd

print(pd.__version__)

2.1.4


### Pandas Series

**What is a Series?**
* A Pandas Series is like a column in a table.

* It is a one-dimensional array holding data of any type.

**Create a simple Pandas Series from a list:**

In [6]:
import pandas as pd

a = [1, 7, 2]

myvar = pd.Series(a)

print(myvar)

0    1
1    7
2    2
dtype: int64


### Labels
* If nothing else is specified, the values are labeled with their index number. First value has index 0, second value has index 1 etc.

* This label can be used to access a specified value.

### Return the first value of the Series:



In [8]:
print(myvar[0])

1


### Create Labels
* With the index argument, you can name your own labels.

**Create your own labels:**

In [9]:
import pandas as pd

a = [1, 7, 2]

myvar = pd.Series(a, index = ["x", "y", "z"])

print(myvar)

x    1
y    7
z    2
dtype: int64


### Return the value of "y":

In [10]:
print(myvar["y"])

7


### Key/Value Objects as Series
* You can also use a key/value object, like a dictionary, when creating a Series.

**Create a simple Pandas Series from a dictionary:**

In [11]:
import pandas as pd

calories = {"day1": 420, "day2": 380, "day3": 390}

myvar = pd.Series(calories)

print(myvar)

day1    420
day2    380
day3    390
dtype: int64


**Create a Series using only data from "day1" and "day2":**

In [12]:
import pandas as pd

calories = {"day1": 420, "day2": 380, "day3": 390}

myvar = pd.Series(calories, index = ["day1", "day2"])

print(myvar)

day1    420
day2    380
dtype: int64


### DataFrames
* Data sets in Pandas are usually multi-dimensional tables, called DataFrames.

* Series is like a column, a DataFrame is the whole table.

**Create a DataFrame from two Series:**

In [13]:
import pandas as pd

data = {
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}

myvar = pd.DataFrame(data)

print(myvar)

   calories  duration
0       420        50
1       380        40
2       390        45


## Pandas DataFrames


**What is a DataFrame?**
* A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table with rows and columns.

**Create a simple Pandas DataFrame:**

In [14]:
import pandas as pd

data = {
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}

#load data into a DataFrame object:
df = pd.DataFrame(data)

print(df)

   calories  duration
0       420        50
1       380        40
2       390        45


### Locate Row
* As you can see from the result above, the DataFrame is like a table with rows and columns.

* Pandas use the loc attribute to return one or more specified row(s)

**Return row 0:**

In [15]:
#refer to the row index:
print(df.loc[0])

calories    420
duration     50
Name: 0, dtype: int64


**Return row 0 and 1:**

In [16]:
#use a list of indexes:
print(df.loc[[0, 1]])

   calories  duration
0       420        50
1       380        40


### Named Indexes
* With the index argument, you can name your own indexes.

**Add a list of names to give each row a name:**

In [17]:
import pandas as pd

data = {
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}

df = pd.DataFrame(data, index = ["day1", "day2", "day3"])

print(df) 

      calories  duration
day1       420        50
day2       380        40
day3       390        45


### Locate Named Indexes
* Use the named index in the loc attribute to return the specified row(s).

**Return "day2":**

In [18]:
#refer to the named index:
print(df.loc["day2"])

calories    380
duration     40
Name: day2, dtype: int64


### Load Files Into a DataFrame
* If your data sets are stored in a file, Pandas can load them into a DataFrame.

**Load a comma separated file (CSV file) into a DataFrame:**

In [23]:
import pandas as pd

df = pd.read_csv('data .csv')

print(df) 

     Duration  Pulse  Maxpulse  Calories
0          60    110       130     409.1
1          60    117       145     479.0
2          60    103       135     340.0
3          45    109       175     282.4
4          45    117       148     406.0
..        ...    ...       ...       ...
164        60    105       140     290.8
165        60    110       145     300.0
166        60    115       145     310.2
167        75    120       150     320.4
168        75    125       150     330.4

[169 rows x 4 columns]


## Pandas Read CSV

**Read CSV Files**
* A simple way to store big data sets is to use CSV files (comma separated files).

* CSV files contains plain text and is a well know format that can be read by everyone including Pandas.

* In our examples we will be using a CSV file called 'data.csv'.

* Download data.csv. or Open data.csv

**Load the CSV into a DataFrame:**

In [24]:
import pandas as pd

df = pd.read_csv('data .csv')

print(df.to_string()) 

     Duration  Pulse  Maxpulse  Calories
0          60    110       130     409.1
1          60    117       145     479.0
2          60    103       135     340.0
3          45    109       175     282.4
4          45    117       148     406.0
5          60    102       127     300.0
6          60    110       136     374.0
7          45    104       134     253.3
8          30    109       133     195.1
9          60     98       124     269.0
10         60    103       147     329.3
11         60    100       120     250.7
12         60    106       128     345.3
13         60    104       132     379.3
14         60     98       123     275.0
15         60     98       120     215.2
16         60    100       120     300.0
17         45     90       112       NaN
18         60    103       123     323.0
19         45     97       125     243.0
20         60    108       131     364.2
21         45    100       119     282.0
22         60    130       101     300.0
23         45   

**Print the DataFrame without the to_string() method:**

In [25]:
import pandas as pd

df = pd.read_csv('data .csv')

print(df) 

     Duration  Pulse  Maxpulse  Calories
0          60    110       130     409.1
1          60    117       145     479.0
2          60    103       135     340.0
3          45    109       175     282.4
4          45    117       148     406.0
..        ...    ...       ...       ...
164        60    105       140     290.8
165        60    110       145     300.0
166        60    115       145     310.2
167        75    120       150     320.4
168        75    125       150     330.4

[169 rows x 4 columns]


**Increase the maximum number of rows to display the entire DataFrame:**

In [28]:
import pandas as pd

pd.options.display.max_rows = 9999

df = pd.read_csv('data .csv')

print(df) 

     Duration  Pulse  Maxpulse  Calories
0          60    110       130     409.1
1          60    117       145     479.0
2          60    103       135     340.0
3          45    109       175     282.4
4          45    117       148     406.0
5          60    102       127     300.0
6          60    110       136     374.0
7          45    104       134     253.3
8          30    109       133     195.1
9          60     98       124     269.0
10         60    103       147     329.3
11         60    100       120     250.7
12         60    106       128     345.3
13         60    104       132     379.3
14         60     98       123     275.0
15         60     98       120     215.2
16         60    100       120     300.0
17         45     90       112       NaN
18         60    103       123     323.0
19         45     97       125     243.0
20         60    108       131     364.2
21         45    100       119     282.0
22         60    130       101     300.0
23         45   

## Pandas Read JSON

**Read JSON**
* Big data sets are often stored, or extracted as JSON.

* JSON is plain text, but has the format of an object, and is well known in the world of programming, including Pandas.

* In our examples we will be using a JSON file called 'data.json'.

**Open data.json.**

## Load the JSON file into a DataFrame:

In [None]:
import pandas as pd

df = pd.read_csv('data .json')

print(df.to_string()) 

**Load a Python Dictionary into a DataFrame:**



In [None]:
import pandas as pd

data = {
  "Duration":{
    "0":60,
    "1":60,
    "2":60,
    "3":45,
    "4":45,
    "5":60
  },
  "Pulse":{
    "0":110,
    "1":117,
    "2":103,
    "3":109,
    "4":117,
    "5":102
  },
  "Maxpulse":{
    "0":130,
    "1":145,
    "2":135,
    "3":175,
    "4":148,
    "5":127
  },
  "Calories":{
    "0":409,
    "1":479,
    "2":340,
    "3":282,
    "4":406,
    "5":300
  }
}

df = pd.DataFrame(data)

print(df) 


## Pandas - Analyzing DataFrames

**Viewing the Data**
* One of the most used method for getting a quick overview of the DataFrame, is the head() method.

* The head() method returns the headers and a specified number of rows, starting from the top.

**Get a quick overview by printing the first 10 rows of the DataFrame:**

In [None]:
import pandas as pd

df = pd.read_csv('data.csv')

print(df.head(10))

**Print the first 5 rows of the DataFrame:**

In [None]:
import pandas as pd

df = pd.read_csv('data.csv')

print(df.head())