# **Pandas Introduction**
What is Pandas?

Pandas is a Python library used for working with data sets.
It has functions for analyzing, cleaning, exploring, and manipulating data.

# Why Use Pandas?
Pandas allows us to analyze big data and make conclusions based on statistical theories.

Pandas can clean messy data sets, and make them readable and relevant.

Relevant data is very important in data science.

The source code for Pandas is located at this github repository https://github.com/pandas-dev/pandas

In [5]:
#Import Pandas
import pandas as pd
mydataset = {
  'cars': ["BMW", "Volvo", "Ford"],
  'passings': [3, 7, 2]
}
myvar = pd.DataFrame(mydataset)
print(myvar)

    cars  passings
0    BMW         3
1  Volvo         7
2   Ford         2


# **Pandas Series**

**What is a Series?**

A Pandas Series is like a column in a table. It is a one-dimensional array holding data of any type.

In [6]:
import pandas as pd
a = [1, 7, 2]
list1 = pd.Series(a)
print(list1)

0    1
1    7
2    2
dtype: int64


# Labels
If nothing else is specified, the values are labeled with their index number. First value has index 0, second value has index 1 etc.

This label can be used to access a specified value.

In [7]:
print(list1[0])

1


In [8]:
# With the index argument, you can name your own labels.
import pandas as pd
a = [1, 7, 2]
myvar = pd.Series(a, index = ["x", "y", "z"])
print(myvar)

x    1
y    7
z    2
dtype: int64


In [9]:
print(myvar["y"])

7


# Key/Value Objects as Series
You can also use a key/value object, like a dictionary, when creating a Series.

In [10]:
import pandas as pd

calories = {"day1": 420, "day2": 380, "day3": 390}
list1 = pd.Series(calories)
print(list1)

day1    420
day2    380
day3    390
dtype: int64


In [12]:
import pandas as pd

calories = {"day1": 420, "day2": 380, "day3": 390}
list1 = pd.Series(calories, index = ["day1", "day2"])
print(list1)

day1    420
day2    380
dtype: int64


# **DataFrames**
Data sets in Pandas are usually multi-dimensional tables, called DataFrames.

Series is like a column, a DataFrame is the whole table.

In [14]:
import pandas as pd
data= {
    "calories": [420, 380, 390],
    "duration": [50, 40, 45],
    "lose_cal": [10, 20, 30]
}

df = pd.DataFrame(data)
print(df)

   calories  duration  lose_cal
0       420        50        10
1       380        40        20
2       390        45        30


**Locate Row**

As you can see from the result above, the DataFrame is like a table with rows and columns.

Pandas use the loc attribute to return one or more specified row(s)

In [15]:
print(df.loc[0])

calories    420
duration     50
lose_cal     10
Name: 0, dtype: int64


In [16]:
print(df.loc[[0, 1]])

   calories  duration  lose_cal
0       420        50        10
1       380        40        20


# **Load Files Into a DataFrame**
If your data sets are stored in a file, Pandas can load them into a DataFrame.

In [None]:
import pandas as pd

#df = pd.read_csv('data.csv')
#df = pd.read_json('data.json')
#df = pd.read_excel('data.xlsx')
print(df)

# **Read CSV Files**
A simple way to store big data sets is to use CSV files (comma separated files).

CSV files contains plain text and is a well know format that can be read by everyone including Pandas.

In our examples we will be using a CSV file called 'data.csv'.

[link text](https://Download data.csv. )

In [18]:
import pandas as pd

df = pd.read_csv('/content/data.csv')
print(df.to_string()) # use to_string() to print the entire DataFrame.

     Duration  Pulse  Maxpulse  Calories
0          60    110       130     409.1
1          60    117       145     479.0
2          60    103       135     340.0
3          45    109       175     282.4
4          45    117       148     406.0
5          60    102       127     300.0
6          60    110       136     374.0
7          45    104       134     253.3
8          30    109       133     195.1
9          60     98       124     269.0
10         60    103       147     329.3
11         60    100       120     250.7
12         60    106       128     345.3
13         60    104       132     379.3
14         60     98       123     275.0
15         60     98       120     215.2
16         60    100       120     300.0
17         45     90       112       NaN
18         60    103       123     323.0
19         45     97       125     243.0
20         60    108       131     364.2
21         45    100       119     282.0
22         60    130       101     300.0
23         45   

In [20]:
import pandas as pd

df = pd.read_csv('/content/data.csv')  # Without to_string method
print(df)

     Duration  Pulse  Maxpulse  Calories
0          60    110       130     409.1
1          60    117       145     479.0
2          60    103       135     340.0
3          45    109       175     282.4
4          45    117       148     406.0
..        ...    ...       ...       ...
164        60    105       140     290.8
165        60    110       145     300.0
166        60    115       145     310.2
167        75    120       150     320.4
168        75    125       150     330.4

[169 rows x 4 columns]
