# PANDAS

## Pandas Introduction
What is Pandas?
Pandas is a Python library used for working with data sets.

It has functions for analyzing, cleaning, exploring, and manipulating data.

The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and was created by Wes McKinney in 2008.



## Why Use Pandas?
Pandas allows us to analyze big data and make conclusions based on statistical theories.

Pandas can clean messy data sets, and make them readable and relevant.

Relevant data is very important in data science.

Data Science: is a branch of computer science where we study how to store, use and 
    analyze data for deriving information from it.



## What Can Pandas Do?
Pandas gives you answers about the data. Like:

Is there a correlation between two or more columns?
What is average value?
Max value?
Min value?
Pandas are also able to delete rows that are not relevant, or contains wrong values, like empty or NULL values. This is called cleaning the data.



##  Where is the Pandas Codebase?
The source code for Pandas is located at this github repository https://github.com/pandas-dev/pandas
github: enables many people to work on the same codebase.

## IMPORTING PANDAS

In [1]:
import pandas

In [2]:
mydataset = {
  'cars': ["BMW", "Volvo", "Ford"],
  'passings': [3, 7, 2]
}
myvar = pandas.DataFrame(mydataset)
print(myvar)

    cars  passings
0    BMW         3
1  Volvo         7
2   Ford         2


## Pandas as pd
Pandas is usually imported under the pd alias.

alias: In Python alias are an alternate name for referring to the same thing.

Create an alias with the as keyword while importing:



In [3]:
import pandas as pd

In [4]:
import pandas as pd

mydataset = {
  'cars': ["BMW", "Volvo", "Ford"],
  'passings': [3, 7, 2]
}

myvar = pd.DataFrame(mydataset)

print(myvar)


    cars  passings
0    BMW         3
1  Volvo         7
2   Ford         2


### Checking Pandas Version
The version string is stored under __version__ attribute.

In [5]:
print(pd.__version__)

1.4.4


## Pandas Series
What is a Series?
A Pandas Series is like a column in a table.

It is a one-dimensional array holding data of any type.

ExampleGet your own Python Server
Create a simple Pandas Series from a list:



In [6]:
import pandas as pd

a = [1, 7, 2]

myvar = pd.Series(a)

print(myvar)


0    1
1    7
2    2
dtype: int64


## Labels
If nothing else is specified, the values are labeled with their index number. First value has index 0, second value has index 1 etc.
This label can be used to access a specified value.
Return the first value of the Series:




In [7]:
print(myvar[0])

1


## Create Labels
With the index argument, you can name your own labels.

Create your own labels:



In [9]:
import pandas as pd

a = [1, 7, 2]

myvar = pd.Series(a, index = ["x", "y", "z"])

print(myvar)
# When you have created labels, you can access an item by referring to the label.


# Return the value of "y":

print(myvar["y"])



x    1
y    7
z    2
dtype: int64
7


### Complete the Pandas modules, do the exercises, take the exam, and you will become w3schools certified!

Key/Value Objects as Series
You can also use a key/value object, like a dictionary, when creating a Series.

ExampleGet your own Python Server
Create a simple Pandas Series from a dictionary:




In [10]:
calories = {"day1": 420, "day2": 380, "day3": 390}

myvar = pd.Series(calories)

print(myvar)

day1    420
day2    380
day3    390
dtype: int64


#### Note: The keys of the dictionary become the labels.

To select only some of the items in the dictionary, use the index argument and 
specify only the items you want to include in the Series.




In [11]:
# Create a Series using only data from "day1" and "day2":

calories = {"day1": 420, "day2": 380, "day3": 390}

myvar = pd.Series(calories, index = ["day1", "day2"])

print(myvar)


day1    420
day2    380
dtype: int64


## DataFrames
Data sets in Pandas are usually multi-dimensional tables, called DataFrames.

Series is like a column, a DataFrame is the whole table.

Create a DataFrame from two Series:


In [5]:
data = {
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}

myvar = pd.DataFrame(data)

print(myvar)


   calories  duration
0       420        50
1       380        40
2       390        45


## Pandas DataFrames
### What is a DataFrame?
A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table with rows and columns.


In [6]:
import pandas as pd
#load data into a DataFrame object:
df = pd.DataFrame(myvar)
print(df) 

   calories  duration
0       420        50
1       380        40
2       390        45


### Locate Row
As you can see from the result above, the DataFrame is like a table with rows and columns.
Pandas use the loc attribute to return one or more specified row(s)
Return row 0:

In [None]:
# Return row 0 and 1:
#use a list of indexes:
X=df.loc[[1]]
PRINT(X)

## Named Indexes
With the index argument, you can name your own indexes.
Add a list of names to give each row a name:

In [7]:

data = {
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}
df = pd.DataFrame(data, index = ["day1", "day2", "day3"])
print(df) 

      calories  duration
day1       420        50
day2       380        40
day3       390        45


# Load Files Into a DataFrame
# If your data sets are stored in a file, Pandas can load them into a DataFrame.


In [35]:
# Use the named index in the loc attribute to return the specified row(s).
# Return "day2":
#refer to the named index:
print(df.loc["day2"])

calories    380
duration     40
Name: day2, dtype: int64


In [51]:
## Load a comma separated file (CSV file) into a DataFrame:

data_path=['DATA']
# df = pd.read_csv('seattle-weather.csv')
# print(df) 
filepath=os.sep.join(data_path+['Salary Data.csv'])
print(filepath)

DATA\Salary Data.csv


## Pandas Read CSV
### Read CSV Files
A simple way to store big data sets is to use CSV files (comma separated files).
CSV files contains plain text and is a well know format that can be read by everyone including Pandas.
In our examples we will be using a CSV file called 'data.csv'.
Download data.csv. or Open data.csv


In [52]:
# Load the CSV into a DataFrame:
df = pd.read_csv(filepath)
print(df.to_string()) 
# # Tip: use to_string() to print the entire DataFrame.
# If you have a large DataFrame with many rows, Pandas will only return the first 5 rows, and the last 5 rows:


     Age  Gender Education Level                              Job Title  Years of Experience  Salary
0     32    Male      Bachelor's                      Software Engineer                  5.0   90000
1     28  Female        Master's                           Data Analyst                  3.0   65000
2     45    Male             PhD                         Senior Manager                 15.0  150000
3     36  Female      Bachelor's                        Sales Associate                  7.0   60000
4     52    Male        Master's                               Director                 20.0  200000
5     29    Male      Bachelor's                      Marketing Analyst                  2.0   55000
6     42  Female        Master's                        Product Manager                 12.0  120000
7     31    Male      Bachelor's                          Sales Manager                  4.0   80000
8     26  Female      Bachelor's                  Marketing Coordinator                  1.

In [55]:
# Print the DataFrame without the to_string() method:
df = pd.read_csv(filepath)
print(df) 


     Age  Gender Education Level                      Job Title  \
0     32    Male      Bachelor's              Software Engineer   
1     28  Female        Master's                   Data Analyst   
2     45    Male             PhD                 Senior Manager   
3     36  Female      Bachelor's                Sales Associate   
4     52    Male        Master's                       Director   
..   ...     ...             ...                            ...   
368   35  Female      Bachelor's       Senior Marketing Analyst   
369   43    Male        Master's         Director of Operations   
370   29  Female      Bachelor's         Junior Project Manager   
371   34    Male      Bachelor's  Senior Operations Coordinator   
372   44  Female             PhD        Senior Business Analyst   

     Years of Experience  Salary  
0                    5.0   90000  
1                    3.0   65000  
2                   15.0  150000  
3                    7.0   60000  
4                   

## max_rows
The number of rows returned is defined in Pandas option settings.
You can check your system's maximum rows with the pd.options.display.max_rows statement.

In [56]:
# Check the number of maximum returned rows:
print(pd.options.display.max_rows) 

60


In my system the number is 60, which means that if the DataFrame contains more than 60 rows,
the print(df) statement will return only the headers and the first and last 5 rows.

You can change the maximum rows number with the same statement.

## Pandas Read JSON
### Read JSON
Big data sets are often stored, or extracted as JSON.
JSON is plain text, but has the format of an object, and is well known in the world of programming, including Pandas.

In our examples we will be using a JSON file called 'data.json'.

Open data.json.

ExampleGet your own Python Server
Load the JSON file into a DataFrame:


In [57]:
df = pd.read_json('data.json')

print(df.to_string()) 
# Tip: use to_string() to print the entire DataFrame.


## Dictionary as JSON
JSON = Python Dictionary

JSON objects have the same format as Python dictionaries.

If your JSON code is not in a file, but in a Python Dictionary, you can load it into a DataFrame directly:


In [58]:
# Load a Python Dictionary into a DataFrame:
data = {
  "Duration":{
    "0":60,
    "1":60,
    "2":60,
    "3":45,
    "4":45,
    "5":60
  },
  "Pulse":{
    "0":110,
    "1":117,
    "2":103,
    "3":109,
    "4":117,
    "5":102
  },
  "Maxpulse":{
    "0":130,
    "1":145,
    "2":135,
    "3":175,
    "4":148,
    "5":127
  },
  "Calories":{
    "0":409,
    "1":479,
    "2":340,
    "3":282,
    "4":406,
    "5":300
  }
}
df = pd.DataFrame(data)
print(df) 


   Duration  Pulse  Maxpulse  Calories
0        60    110       130       409
1        60    117       145       479
2        60    103       135       340
3        45    109       175       282
4        45    117       148       406
5        60    102       127       300


## Pandas - Analyzing DataFrames
### Viewing the Data
One of the most used method for getting a quick overview of the DataFrame, is the head() method.
The head() method returns the headers and a specified number of rows, starting from the top.
Get a quick overview by printing the first 10 rows of the DataFrame:

In [59]:
df = pd.read_csv(filepath)

print(df.head(10))

   Age  Gender Education Level              Job Title  Years of Experience  \
0   32    Male      Bachelor's      Software Engineer                  5.0   
1   28  Female        Master's           Data Analyst                  3.0   
2   45    Male             PhD         Senior Manager                 15.0   
3   36  Female      Bachelor's        Sales Associate                  7.0   
4   52    Male        Master's               Director                 20.0   
5   29    Male      Bachelor's      Marketing Analyst                  2.0   
6   42  Female        Master's        Product Manager                 12.0   
7   31    Male      Bachelor's          Sales Manager                  4.0   
8   26  Female      Bachelor's  Marketing Coordinator                  1.0   
9   38    Male             PhD       Senior Scientist                 10.0   

   Salary  
0   90000  
1   65000  
2  150000  
3   60000  
4  200000  
5   55000  
6  120000  
7   80000  
8   45000  
9  110000  


In our examples we will be using a CSV file called 'data.csv'.

Download data.csv, or open data.csv in your browser.

Note: if the number of rows is not specified, the head() method will return the top 5 rows.

In [60]:
# Print the first 5 rows of the DataFrame:

import pandas as pd

df = pd.read_csv(filepath)

print(df.head())

   Age  Gender Education Level          Job Title  Years of Experience  Salary
0   32    Male      Bachelor's  Software Engineer                  5.0   90000
1   28  Female        Master's       Data Analyst                  3.0   65000
2   45    Male             PhD     Senior Manager                 15.0  150000
3   36  Female      Bachelor's    Sales Associate                  7.0   60000
4   52    Male        Master's           Director                 20.0  200000


##### There is also a tail() method for viewing the last rows of the DataFrame.

The tail() method returns the headers and a specified number of rows, starting from the bottom.


In [61]:
# Print the last 5 rows of the DataFrame:
print(df.tail()) 

     Age  Gender Education Level                      Job Title  \
368   35  Female      Bachelor's       Senior Marketing Analyst   
369   43    Male        Master's         Director of Operations   
370   29  Female      Bachelor's         Junior Project Manager   
371   34    Male      Bachelor's  Senior Operations Coordinator   
372   44  Female             PhD        Senior Business Analyst   

     Years of Experience  Salary  
368                  8.0   85000  
369                 19.0  170000  
370                  2.0   40000  
371                  7.0   90000  
372                 15.0  150000  


In [62]:
# Info About the Data
# The DataFrames object has a method called info(), that gives you more information about the data set.
# Print information about the data:
print(df.info()) 


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 373 entries, 0 to 372
Data columns (total 6 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Age                  373 non-null    int64  
 1   Gender               373 non-null    object 
 2   Education Level      373 non-null    object 
 3   Job Title            373 non-null    object 
 4   Years of Experience  373 non-null    float64
 5   Salary               373 non-null    int64  
dtypes: float64(1), int64(2), object(3)
memory usage: 17.6+ KB
None


### Null Values
The info() method also tells us how many Non-Null values there are present in each column, and in our data set it seems like there are 164 of 169 Non-Null values in the "Calories" column.

Which means that there are 5 rows with no value at all, in the "Calories" column, for whatever reason.

Empty values, or Null values, can be bad when analyzing data, and you should consider removing rows with empty values. This is a step towards what is called cleaning data, and you will learn more about that in the next chapters.

