# Pandas

Pandas is a Python library used for working with data sets. It has functions for analyzing, cleaning, exploring, and manipulating data
https://pandas.pydata.org/
https://pandas.pydata.org/docs/getting_started/index.html
Install pip install pandas  conda install pandas  (it comes installed with Anaconda)

![01_table_dataframe.svg](attachment:01_table_dataframe.svg)

## Data Frame from Series
Each column in a DataFrame is a Series
When selecting a single column of a pandas DataFrame, the result is a pandas Series. To select the column, use the column label in between square brackets []

In [4]:
import pandas as pd
# Create pandas Series
courses = pd.Series(["Python","R","Hadoop"])
fees = pd.Series([22000,25000,23000])
discount  = pd.Series([1000,2300,1000])
# Combine two series.
df=pd.concat([courses,fees],axis=1)

In [5]:
print(df)

        0      1
0  Python  22000
1       R  25000
2  Hadoop  23000


In [6]:
#combine multiple series.
df=pd.concat([courses,fees,discount],axis=1)
print(df)

        0      1     2
0  Python  22000  1000
1       R  25000  2300
2  Hadoop  23000  1000


### Names of Columns

In [7]:
courses = pd.Series(["Python","R","Hadoop"], name='Courses')
fees = pd.Series([22000,25000,23000], name='fees')
discount  = pd.Series([1000,2300,1000], name='discount')
df1=pd.concat([courses,fees,discount],axis=1)
print(df1)

  Courses   fees  discount
0  Python  22000      1000
1       R  25000      2300
2  Hadoop  23000      1000


In [9]:
# create index values for each row
indexColumn = ['C1','C2','C3']
df1.index = indexColumn
print(df1)

   Courses   fees  discount
C1  Python  22000      1000
C2       R  25000      2300
C3  Hadoop  23000      1000


In [11]:
# Index value to Column 
df1.reset_index()

Unnamed: 0,index,Courses,fees,discount
0,C1,Python,22000,1000
1,C2,R,25000,2300
2,C3,Hadoop,23000,1000


## add empty column to DF

In [None]:
# Add empty column to the DataFrame
df["Blank_Column"] = " "
df["NaN_Column"] = np.nan
df["None_Column"] = None

In [None]:
# Add an empty columns using the assign() method
df2 = df.assign(Blank_Column=" ", NaN_Column = np.nan, None_Column=None)

In [None]:
# Add multiple columns with NaN , uses columns param
df2 = df.reindex(columns = df.columns.tolist() + ["None_Column", "None_Column_2"])
print(df2)

In [None]:
# Add multiple columns with NaN, , uses axis param 
df2 = df.reindex(df.columns.tolist() + ["None_Column", "None_Column_2"],index=1)
print(df2)

In [None]:
# Using insert(), add empty column at first position
df.insert(0,"Blank_Column", " ")
print(df)

## Replace Missings Values

In [None]:
Datafame.fillna() is used to replace NaN/None with any values.
DataFrame.replace() does find and replace. It finds NaN values and replaces them with a specific value.
NaN stands for Not A Number and is one of the common ways to represent the missing value in the data. Sometimes None also used.
numpy.nan is use to specify a NaN value. NaN is a type of float.

In [24]:
#Replce NaN with zero on all columns 
df.fillna(0)

Unnamed: 0,Courses,Fee,Duration,Discount,Period,freq_count
0,Python,20000,30days,1000,Python-30days,1
1,R,25000,40days,1500,R-40days,1
2,Hadoop,26000,35days,2500,Hadoop-35days,1
3,Tableau,22000,40days,2100,Tableau-40days,1
4,PowerBI,24000,60days,2000,PowerBI-60days,1


In [None]:
#Repalce inplace 
df.fillna(0,inplace=True)

In [None]:
# Replace on single column
df["Fee"] = df["Fee"].fillna(0)

In [None]:
# Replace on multiple columns
df[["Fee","Duration"]] = df[["Fee","Duration"]].fillna(0)

In [None]:
# Using replace()
df["Fee"] = df["Fee"].replace(np.nan, 0)

In [None]:
# Using replace()
df2 = df.replace(np.nan, 0)

## Combine Two Columns of Text

In [25]:
import pandas as pd
technologies = ({
     'Courses':["Python","R","Hadoop","Tableau","PowerBI"],
     'Fee' :[20000,25000,26000,22000,24000],
     'Duration':['30days','40days','35days','40days','60days'],
     'Discount':[1000,1500,2500,2100,2000]
               })
df = pd.DataFrame(technologies)
print(df)

   Courses    Fee Duration  Discount
0   Python  20000   30days      1000
1        R  25000   40days      1500
2   Hadoop  26000   35days      2500
3  Tableau  22000   40days      2100
4  PowerBI  24000   60days      2000


In [26]:
#Using + operator to combine two columns
df['Courses'].astype(str) +"-"+ df["Duration"]

0     Python-30days
1          R-40days
2     Hadoop-35days
3    Tableau-40days
4    PowerBI-60days
dtype: object

In [27]:
# Using apply() method to combine two columns of text
#df["Period"] = df[["Courses", "Duration"]].apply("-".join, axis=1)
df[["Courses", "Duration"]].apply("-".join, axis=1)

0     Python-30days
1          R-40days
2     Hadoop-35days
3    Tableau-40days
4    PowerBI-60days
dtype: object

In [28]:
# Using DataFrame.agg() to combine two columns of text
#df["Period"] = df[['Courses', 'Duration']].agg('-'.join, axis=1)
df[['Courses', 'Duration']].agg('-'.join, axis=1)

0     Python-30days
1          R-40days
2     Hadoop-35days
3    Tableau-40days
4    PowerBI-60days
dtype: object

In [29]:
# Using Series.str.cat() function
#df["Period"] = df["Courses"].str.cat(df["Duration"], sep="-")
df["Courses"].str.cat(df["Duration"], sep="-")

0     Python-30days
1          R-40days
2     Hadoop-35days
3    Tableau-40days
4    PowerBI-60days
Name: Courses, dtype: object

In [30]:
# Using DataFrame.apply() and lambda function
#df["Period"] = df[["Courses", "Duration"]].apply(lambda x: "-".join(x), axis =1)
df[["Courses", "Duration"]].apply(lambda x: "-".join(x), axis =1)

0     Python-30days
1          R-40days
2     Hadoop-35days
3    Tableau-40days
4    PowerBI-60days
dtype: object

In [31]:
# Using map() function to combine two columns of text
#df["Period"] = df["Courses"].map(str) + "-" + df["Duration"]
df["Courses"].map(str) + "-" + df["Duration"]

0     Python-30days
1          R-40days
2     Hadoop-35days
3    Tableau-40days
4    PowerBI-60days
dtype: object

## Frequency of a Value

In [32]:
#Using pandas way, Series.value_counts()
df['Courses'].value_counts()

Python     1
R          1
Hadoop     1
Tableau    1
PowerBI    1
Name: Courses, dtype: int64

In [33]:
df.Courses.value_counts

<bound method IndexOpsMixin.value_counts of 0     Python
1          R
2     Hadoop
3    Tableau
4    PowerBI
Name: Courses, dtype: object>

In [34]:
# Using groupby() & count()
df.groupby('Courses').count()

Unnamed: 0_level_0,Fee,Duration,Discount
Courses,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Hadoop,1,1,1
PowerBI,1,1,1
Python,1,1,1
R,1,1,1
Tableau,1,1,1


In [35]:
# Add frequecy count as new column to DataFrame
#df['freq_count'] = df.groupby('Courses')['Courses'].transform('count')
df.groupby('Courses')['Courses'].transform('count')

0    1
1    1
2    1
3    1
4    1
Name: Courses, dtype: int64

In [36]:
# Getting value counts of multiple columns
df.groupby(['Courses', 'Fee']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Duration,Discount
Courses,Fee,Unnamed: 2_level_1,Unnamed: 3_level_1
Hadoop,26000,1,1
PowerBI,24000,1,1
Python,20000,1,1
R,25000,1,1
Tableau,22000,1,1


In [37]:
# Get occurence of value by index (row label)
df.index.value_counts()

0    1
1    1
2    1
3    1
4    1
dtype: int64

In [38]:
# Include NaN, None, Null values in the count.
df['Courses'].value_counts(dropna=False)

Python     1
R          1
Hadoop     1
Tableau    1
PowerBI    1
Name: Courses, dtype: int64