# Pandas Introduction


Pandas is a Python library used for working with data sets.
It has functions for analyzing, cleaning, exploring, and manipulating data. Pandas allows us to analyze big data and make conclusions based on statistical theories. Pandas can clean messy data sets, and make them readable and relevant.

# Import Pandas

In [1]:
import pandas as pd

In [2]:
import pandas as pd
print(pd.__version__)

1.1.3


# Pandas Series

A Pandas Series is like a column in a table. It is a one-dimensional array holding data of any type.

In [3]:
a = [1, 7, 2]
myvar = pd.Series(a)
print(myvar)

0    1
1    7
2    2
dtype: int64


# Labels

If nothing else is specified, the values are labeled with their index number. First value has index 0, second value has index 1 etc.
This label can be used to access a specified value.

In [4]:
a = [1, 7, 2]
myvar = pd.Series(a, index = ["x", "y", "z"])
print(myvar)

x    1
y    7
z    2
dtype: int64


# Key/Value Objects as Series

You can also use a key/value object, like a dictionary, when creating a Series.

In [12]:
calories = {"day1": 420, "day2": 380, "day3": 390}
myvar = pd.Series(calories)
print(myvar)

day1    420
day2    380
day3    390
dtype: int64


To select only some of the items in the dictionary, use the index argument and specify only the items you want to include in the Series.

In [14]:
calories = {"day1": 420, "day2": 380, "day3": 390}
myvar = pd.Series(calories, index = ["day1", "day2"])
print(myvar)

day1    420
day2    380
dtype: int64


# DataFrames

Data sets in Pandas are usually multi-dimensional tables, called DataFrames.
Series is like a column, a DataFrame is the whole table.

In [5]:
data = {
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}
df = pd.DataFrame(data)
print(df)

   calories  duration
0       420        50
1       380        40
2       390        45


# Locate Row

As you can see from the result above, the DataFrame is like a table with rows and columns.
Pandas use the loc attribute to return one or more specified row(s)

In [6]:
print(df.loc[[0, 1]])

   calories  duration
0       420        50
1       380        40


# Named Indexes

With the index argument, you can name your own indexes.

In [15]:
data = {
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}
df = pd.DataFrame(data, index = ["day1", "day2", "day3"])
print(df)

      calories  duration
day1       420        50
day2       380        40
day3       390        45


# Locate Named Indexes

Use the named index in the loc attribute to return the specified row(s).

In [16]:
print(df.loc["day2"])

calories    380
duration     40
Name: day2, dtype: int64


# Load Files Into a DataFrame

If your data sets are stored in a file, Pandas can load them into a DataFrame.

In [18]:
df = pd.read_csv('cleveland Heart disease.csv') 
df.head(3)

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,class
0,63,1,1,145,233,1,2,150,0,2.3,3,0,6,0
1,67,1,4,160,286,0,2,108,1,1.5,2,3,3,2
2,67,1,4,120,229,0,2,129,1,2.6,2,2,7,1


# Pandas Read CSV

A simple way to store big data sets is to use CSV files (comma separated files).
CSV files contains plain text and is a well know format that can be read by everyone including Pandas.

# max_rows

You can check your system's maximum rows with the pd.options.display.max_rows statement.

In [8]:
print(pd.options.display.max_rows) 

60


# Dictionary as JSON

In [22]:
import pandas as pd
data = {
  "Duration":{
    "0":60,
    "1":60,
    "2":60,  
  },
  "Pulse":{
    "0":110,
    "1":117,
    "2":103,
  },
  "Maxpulse":{
    "0":130,
    "1":145,
    "2":135,
  },
  "Calories":{
    "0":409,
    "1":479,
    "2":340,
  }
}
df = pd.DataFrame(data)
print(df) 

   Duration  Pulse  Maxpulse  Calories
0        60    110       130       409
1        60    117       145       479
2        60    103       135       340


# Pandas - Analyzing DataFrames

One of the most used method for getting a quick overview of the DataFrame, is the head() method.
The head() method returns the headers and a specified number of rows, starting from the top.

In [19]:
df = pd.read_csv('cleveland Heart disease.csv')
print(df.head(2))

   age  sex  cp  trestbps  chol  fbs  restecg  thalach  exang  oldpeak  slope  \
0   63    1   1       145   233    1        2      150      0      2.3      3   
1   67    1   4       160   286    0        2      108      1      1.5      2   

   ca  thal  class  
0   0     6      0  
1   3     3      2  


There is also a tail() method for viewing the last rows of the DataFrame.
The tail() method returns the headers and a specified number of rows, starting from the bottom.

In [20]:
print(df.tail(2))

     age  sex  cp  trestbps  chol  fbs  restecg  thalach  exang  oldpeak  \
301   57    0   2       130   236    0        2      174      0      0.0   
302   38    1   3       138   175    0        0      173      0      0.0   

     slope  ca  thal  class  
301      2   1     3      1  
302      1   1     3      0  


# Info About the Data

In [11]:
print(df.info()) 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       303 non-null    int64  
 1   sex       303 non-null    int64  
 2   cp        303 non-null    int64  
 3   trestbps  303 non-null    int64  
 4   chol      303 non-null    int64  
 5   fbs       303 non-null    int64  
 6   restecg   303 non-null    int64  
 7   thalach   303 non-null    int64  
 8   exang     303 non-null    int64  
 9   oldpeak   303 non-null    float64
 10  slope     303 non-null    int64  
 11  ca        303 non-null    int64  
 12  thal      303 non-null    int64  
 13  class     303 non-null    int64  
dtypes: float64(1), int64(13)
memory usage: 33.3 KB
None
