What is Pandas?

Pandas is a Python library used for working with data sets.

It has functions for analyzing, cleaning, exploring, and manipulating data.

The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and was created by Wes McKinney in 2008.
Pandas allows us to analyze big data and make conclusions based on statistical theories.

Pandas can clean messy data sets, and make them readable and relevant.

Relevant data is very important in data science.

pandas

Primary Purpose:

pandas is designed for data manipulation and analysis, particularly for working with structured data.
It provides tools for data cleaning, preparation, and wrangling.

Data Structures:

The core data structures in pandas are Series (1-dimensional) and DataFrame (2-dimensional).
It supports heterogeneous data types (different columns in a DataFrame can be of different types).

Functionality:

pandas offers a wide range of functionalities for data manipulation, including data alignment, merging, reshaping, grouping, and time-series analysis.
It has powerful indexing and selection capabilities, allowing for more complex data manipulation tasks.

Use Case:

pandas is widely used in data analysis tasks such as data cleaning, transformation, and visualization.
It is particularly useful for handling missing data, filtering data, and performing operations on groups of data.

In [None]:
import pandas as pd

In [None]:
pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
pd.Series([1, 2, 3, 4, 5])
pd.read_csv('filename.csv')
df.to_csv('filename.csv')
pd.read_excel('filename.xlsx')
df.to_excel('filename.xlsx')
df.head(n)
df.tail(n)
df.dtypes
df.describe()

# Select column
df['column_name']

# Select multiple columns
df[['col1', 'col2']]

# Select rows by position
df.iloc[5:10]

# Select rows by index label
df.loc['index_one':'index_five']

# Adding a new column
df['new_col'] = df['col1'] + df['col2']

# Deleting columns:
df.drop('column_name', axis=1, inplace=True)

df.dropna()

df.fillna(value='fill_value')

df.isna()

df.groupby('col_name').sum()
df.groupby('col_name').agg({'col1': 'sum', 'col2': 'mean'})

# Concatenate vertically
pd.concat([df1, df2])

# Concatenate horizontally
pd.concat([df1, df2], axis=1)

# SQL-style joins
pd.merge(df1, df2, on='key_column', how='left')

# Pivot tables
df.pivot_table(values='D', index=['A', 'B'], columns=['C'])

df.stack()
df.unstack()

df.sort_index()
df.sort_values(by='column')
df.rank()

data['Rank'] = data['Weighted Score'].rank(method='dense', ascending=False)

In [None]:
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
print(df['A'] + df['B'])

0    5
1    7
2    9
dtype: int64


In [None]:
df.drop('A', axis=1, inplace=True) # delete column 'A'
print(df)

df.drop(0, inplace=True) # delete row 0
print(df)

df.isna()

0    5
1    7
2    9
dtype: int64
   B
0  4
1  5
2  6
   B
1  5
2  6
1  B    5
2  B    6
dtype: int64


In [None]:
print(df)
print(df.stack())
print(df.unstack())

   A  B
0  1  4
1  2  5
2  3  6
0  A    1
   B    4
1  A    2
   B    5
2  A    3
   B    6
dtype: int64
A  0    1
   1    2
   2    3
B  0    4
   1    5
   2    6
dtype: int64


In [None]:
df.sort_index()

Unnamed: 0,A,B
0,1,4
1,2,5
2,3,6


In [None]:
# A Pandas Series is like a column in a table.

# It is a one-dimensional array holding data of any type.

a = [1, 7, 2]

myvar = pd.Series(a)

print(myvar)

print(myvar[0])

myvar = pd.Series(a, index = ["x", "y", "z"])

print(myvar)

print(myvar["y"])

0    1
1    7
2    2
dtype: int64
1
x    1
y    7
z    2
dtype: int64
7


In [None]:
calories = {"day1": 420, "day2": 380, "day3": 390}

myvar = pd.Series(calories)

print(myvar)

myvar = pd.Series(calories, index = ["day1", "day2"])

print(myvar)

day1    420
day2    380
day3    390
dtype: int64
day1    420
day2    380
dtype: int64


What is a DataFrame?

A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table with rows and columns.

In [None]:
mydataset = {
  'cars': ["BMW", "Volvo", "Ford"],
  'passings': [3, 7, 2]
}

df = pd.DataFrame(mydataset)

print(df)

#refer to the row index:
print(df.loc[0])

#use a list of indexes:Return row 0 and 1:
print(df.loc[[0, 1]])

    cars  passings
0    BMW         3
1  Volvo         7
2   Ford         2
cars        BMW
passings      3
Name: 0, dtype: object
    cars  passings
0    BMW         3
1  Volvo         7


In [None]:
data = {
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}

df = pd.DataFrame(data, index = ["day1", "day2", "day3"])

print(df)

print(df.loc["day2"])


      calories  duration
day1       420        50
day2       380        40
day3       390        45
calories    380
duration     40
Name: day2, dtype: int64


In [None]:
df = pd.read_csv('data.csv')

print(df)

# use to_string() to print the entire DataFrame.
print(df.to_string())

# Check the number of maximum returned rows:
print(pd.options.display.max_rows)

# In my system the number is 60, which means that if the DataFrame contains more than 60 rows, the print(df) statement will return only the headers and the first and last 5 rows.

# You can change the maximum rows number with the same statement.
pd.options.display.max_rows = 9999

FileNotFoundError: [Errno 2] No such file or directory: 'data.csv'

In [None]:
df = pd.read_json('data.json')

print(df.to_string())

In [None]:
data = {
  "Duration":{
    "0":60,
    "1":60,
    "2":60,
    "3":45,
    "4":45,
    "5":60
  },
  "Pulse":{
    "0":110,
    "1":117,
    "2":103,
    "3":109,
    "4":117,
    "5":102
  },
  "Maxpulse":{
    "0":130,
    "1":145,
    "2":135,
    "3":175,
    "4":148,
    "5":127
  },
  "Calories":{
    "0":409,
    "1":479,
    "2":340,
    "3":282,
    "4":406,
    "5":300
  }
}

df = pd.DataFrame(data)

print(df)

   Duration  Pulse  Maxpulse  Calories
0        60    110       130       409
1        60    117       145       479
2        60    103       135       340
3        45    109       175       282
4        45    117       148       406
5        60    102       127       300


In [None]:
print(df.head(10))
print(df.tail())

Cleaning Data

In [None]:
# Return a new Data Frame with no empty cells:remove rows that contain empty cells
new_df = df.dropna()

print(new_df.to_string())

# If you want to change the original DataFrame, use the inplace = True argument:
df.dropna(inplace = True)

print(df.to_string())

# Replace NULL values with the number 130:

df.fillna(130, inplace = True)

# Replace NULL values in the "Calories" columns with the number 130:
df["Calories"].fillna(130, inplace = True)

# Calculate the MEAN, and replace any empty values with it:
x = df["Calories"].mean()
x = df["Calories"].median()
x = df["Calories"].mode()[0]


df["Calories"].fillna(x, inplace = True)

In [None]:
# convert all cells in the 'Date' column into dates.

df['Date'] = pd.to_datetime(df['Date'])

# Set "Duration" = 45 in row 7:
df.loc[7, 'Duration'] = 45

for x in df.index:
  if df.loc[x, "Duration"] > 120:
    df.loc[x, "Duration"] = 120

for x in df.index:
  if df.loc[x, "Duration"] > 120:
    df.drop(x, inplace = True)

# Returns True for every row that is a duplicate, otherwise False:
print(df.duplicated())


In [None]:
# Show the relationship between the columns:
df.corr()

In [None]:
# Pandas uses the plot() method to create diagrams.
df.plot()

df.plot(kind = 'scatter', x = 'Duration', y = 'Calories')

df["Duration"].plot(kind = 'hist')

plt.show()