## Pandas Library: For **Data cleaning** and **Data manipulation** in the data analysis field 
- Pandas is a powerful library in python for data manipulation, widely used for the data analysis and data cleaning. 
- It provide two primary data structure called **Series** and **DataFrame** 
- A Series is a one-dimensional array (1d array object) in other words array like object.
- The DataFrame is a two dimensional (2-dimensional) array like object, size mutable, **potentially heterogeneous tabular data structure with the labelled axes (rows and columns)** 

Things to see:
  - Pandas - DataFrames and Series 

In [None]:
# !pip install numpy
# !pip install pandas




In [None]:
import pandas as pd
import numpy as np

# Create the Series which is 1d array like object which can hold any datatype. It is similar to column in a table. 
data=[1,2,3,4,5,6] # its just a list 

# create a series through the list using the Series function in the pd 
series = pd.Series(data)
print("Series:\n",series)
#  the series may looks like an column in the table 
# 0    1
# 1    2
# 2    3
# 3    4
# 4    5
# 5    6
# dtype: int64
# 0 1 2 3 4 5 are the default index 
print("dtype:", type(series))
# <class 'pandas.core.series.Series'>
# we can use this series and work with numpy library 

np_array = np.array(series*2)
print("np_array:", np_array)
print("median:",np.median(np_array))
print("std.deviation:", np.std(np_array))
print("variance",np.var(np_array))
print("Mean",np.mean(np_array))
print("average",np.average(np_array))

# Lets create the Series from the dictionary
obj1 = {"name": "Nitro", "age":25 , "height": 123.5}
obj2 = {"a": 1, "b":2 , "c": 3}

series1 = pd.Series(obj1)
series2 = pd.Series(obj2)
print(series1)
print(type(series1))
print(series2)
print(type(series2))

#  here the index for the list elements are created on its own by pandas 
# In dictionary data, it takes the key as a index for the data 

# if we want to create our own index for the list of element we use in the series creation, we can 
# lets do it
data = [10,20,30,40]
index=["a","b","c","d"]

series_list = pd.Series(data, index=index)
print(series_list)


Series:
 0    1
1    2
2    3
3    4
4    5
5    6
dtype: int64
dtype: <class 'pandas.core.series.Series'>
np_array: [ 2  4  6  8 10 12]
median: 7.0
std.deviation: 3.415650255319866
variance 11.666666666666666
Mean 7.0
average 7.0
name      Nitro
age          25
height    123.5
dtype: object
<class 'pandas.core.series.Series'>
a    1
b    2
c    3
dtype: int64
<class 'pandas.core.series.Series'>
a    10
b    20
c    30
d    40
dtype: int64


In [18]:
# Lets talk about the DataFrame which is 2D mutable and potentially heterogeneous tabular form of data structured with labeled axes. 
# Main difference between the Series and DataFrame is no of columns in those two data structure
# Series only have one column whereas the DataFrame can have as many as columns we want with the labelled axes

# Create the DataFrame from a dictionary of list 

data = {
  "Name": ["Natarajan","Nitro","Thanaraj","Athistalakshmi"],
  "Age":[32,30,60,23],
  "City": ["Trichy","Ariyalur","Perambalur","Theni"]
}

df = pd.DataFrame(data)

print((type(df))) #  <class 'pandas.core.frame.DataFrame'>

# we can use this dataframe for creating an numpy array by the numpy library
num_arr = np.array(df)
print("numpy_arr_from_dataframe\n", num_arr)
print(num_arr[...,1])
# create an DataFrame from a list of dictionary as well 

data1 = [{"Name":"Natarajan", "Age": 32 , "City": "Trichy"},
         {"Name":"Nitro", "Age": 30 , "City": "Thanjavur"},
         {"Name":"Thanaraj", "Age": 28 , "City": "Ariyalur"},
         {"Name":"Athista", "Age": 20 , "City": "Ariyalur"}
         ]
df1 = pd.DataFrame(data1)
print("df1: \n", df1)

# to get all the rows with respect to the column name called "name"

# to access the column we can index to access the column through the column name of the respective dataframe
name_age = df1["Name"]
age = df1["Age"]
display(name_age)
print("each column data type in data frame : ",type(age))

# In general, we often work with bigger dataset files such as xlsx or csv. 
# So to load/read those files into the pandas, we have method called read_csv method in pandas

# to read the csv using the pd.read_csv("path")
df = pd.read_csv("./data_set/data.csv")

# to get the top 5 records
display(df.head(5))

# to get the last 5 records
display(df.tail(5))

<class 'pandas.core.frame.DataFrame'>
numpy_arr_from_dataframe
 [['Natarajan' 32 'Trichy']
 ['Nitro' 30 'Ariyalur']
 ['Thanaraj' 60 'Perambalur']
 ['Athistalakshmi' 23 'Theni']]
[32 30 60 23]
df1: 
         Name  Age       City
0  Natarajan   32     Trichy
1      Nitro   30  Thanjavur
2   Thanaraj   28   Ariyalur
3    Athista   20   Ariyalur


0    Natarajan
1        Nitro
2     Thanaraj
3      Athista
Name: Name, dtype: object

each column data type in data frame :  <class 'pandas.core.series.Series'>


Unnamed: 0,Date,Category,Value,Product,Sales,Region
0,2023-01-01,A,28.0,Product1,754.0,East
1,2023-01-02,B,39.0,Product3,110.0,North
2,2023-01-03,C,32.0,Product2,398.0,East
3,2023-01-04,B,8.0,Product1,522.0,East
4,2023-01-05,B,26.0,Product3,869.0,North


Unnamed: 0,Date,Category,Value,Product,Sales,Region
45,2023-02-15,B,99.0,Product2,599.0,West
46,2023-02-16,B,6.0,Product1,938.0,South
47,2023-02-17,B,69.0,Product3,143.0,West
48,2023-02-18,C,65.0,Product3,182.0,North
49,2023-02-19,C,11.0,Product3,708.0,North



Where to Use ```pd.loc[] ```and ```pd.iloc[]``` in Real-World Data Science Tasks?

Answer: 
  - Data Cleaning & Filtering: Extracting specific rows based on conditions.
  - Feature Selection: Selecting specific columns for machine learning models.
  - Data Preprocessing: Modifying data, filling missing values in certain rows.
  - Exploratory Data Analysis (EDA): Selecting subsets of data for visualization.
  - Data Transformation: Creating new calculated fields based on certain conditions.

In [None]:
# The .loc[] and .iloc[] methods in pandas are primarily used for selecting data from a DataFrame 
# based on labels or positions. 

# Here’s where and how you can use them:

#          1. When Selecting Rows and Columns:
# .loc[] is used to select data by label/index names.
# .iloc[] is used to select data by integer positions.
import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Charlie'],
        'Age': [25, 30, 35],
        'Score': [85, 90, 95]}

df = pd.DataFrame(data, index=['a', 'b', 'c'])

# Using loc to select by index label
print(df.loc['a'])  # Select row with index label 'a'
print("ath row 0th index value is:", df.loc['a'])  # Select row with index label 'a'

# Using iloc to select by integer position
print(df.iloc[0])  # Select first row (position 0)
print("Element", df.iloc[0,2]) # to check the element in the table by their index location 

# To display any single column we can use df["Name"] simply
print("Name row", df["Name"], "\n name row data type:\n",type(df["Name"]) )


#         2. Filtering Data Based on Conditions:
# .loc[] is commonly used for filtering data based on conditions.

# Select all rows where Age > 25
df_filtered = df.loc[df['Age'] > 25]
display("age>25:", df_filtered)


#         3. Selecting Specific Columns:
# .loc[] can be used to select specific columns by name.
# .iloc[] can be used to select specific columns by position.

# Select 'Name' and 'Score' columns using loc
print(df.loc[:, ['Name', 'Score']])

# Select first two columns using iloc
print(df.iloc[:, [0, 1]])


#        4. Modifying Data:
# .loc[] is useful for updating values in specific rows and columns.

# Update Score for index 'b'
df.loc['b', 'Score'] = 95
print(df)


# 5. Slicing Rows and Columns:
# .iloc[] can be used for slicing by row and column positions.

# Select first two rows and first two columns
print(df.iloc[0:2, 0:2])


# 6. Selecting a Single Value (Scalar Selection):
# You can use .loc[] and .iloc[] to get a specific value.

# Get the value at row 'c' and column 'Score'
print(df.loc['c', 'Score'])

# Get the value at first row, second column
print(df.iloc[0, 1])


# to take the specific element in the table 
# df.at[1,"Age"]
# x = df.loc[2,'Name']
# print(x)


# use indexing way to access the elements as we want: loc[], iloc[],at[],iat[]

In [33]:
import pandas as pd
df = pd.read_csv("data_set/data.csv")
# Accessing the row based on its index and accessing the row value by its column name 
df.loc[0] # it gives the 0 th row
df.loc[0, "Sales"] # it gives the 0th row's Sales Value

# Accessing the row based on its column name  
df.loc[:,"Sales"]

# Taking multiple columns 
df[["Product","Date", "Category" , "Sales" ,"Value"]]

# Filtering & Conditional Selection
df.loc[df["Sales"] < 200] 
df.query("Sales > 700")

# Methods for extracting specific rows based on conditions.

df[df['Sales'] > 800] # Filter rows
df.query("Sales > 300") # Query-based filtering
df[(df['Sales'] > 500) & (df['Value'] <1000)] # Multiple conditions
# df["Sales"].isin([df["value"] >700]) # Filter rows matching a list of values
# df[df['Date'].str.contains('NaN')] # Filter string-based values

Unnamed: 0,Date,Category,Value,Product,Sales,Region
0,2023-01-01,A,28.0,Product1,754.0,East
3,2023-01-04,B,8.0,Product1,522.0,East
4,2023-01-05,B,26.0,Product3,869.0,North
6,2023-01-07,A,16.0,Product1,936.0,East
8,2023-01-09,C,37.0,Product3,772.0,West
9,2023-01-10,A,22.0,Product2,834.0,West
10,2023-01-11,B,7.0,Product1,842.0,North
12,2023-01-13,A,70.0,Product3,628.0,South
14,2023-01-15,A,47.0,Product2,893.0,West
16,2023-01-17,C,93.0,Product2,511.0,South


In [None]:
import pandas as pd
df = pd.read_csv("data_set/data.csv")
# loc[]
'''
display(df.loc[0]) # it shows the 0th row data of dataframe
display(df.loc[1, "Sales"]) 
display(df.loc[(df["Category"]=="A") & (df["Region"]=="East")])
'''
# iloc
# display(df.iloc[0:,4])

# at[] row number and column name to access the element 

print(df.at[4,"Category"]) # it shows the 4th row specified column value

# iat[] access using row and column index
print(df.iat[1,4])
print(df.loc[1].iat[1])


In [None]:
# Data Manipulation with Dataframe
# lets take the example data frame

import pandas as pd

data = {"Name": ["Natarajan","Nitro","Thanaraj","Athistalakshmi"],
  "Age":[32,30,60,23],
  "City": ["Trichy","Ariyalur","Perambalur","Theni"]}
  
df = pd.DataFrame(data)
print(df)

# Lets manipulate the dataframe

# To add a new column, simply we can add the values in the list with the column 
df["Salary"] = [50000,80000,45000,45698]
df["Code"] = [12,56,45,85]

# lets know how to remove a column

# df.drop("Code") 
# while we execute this, we get an error like KeyError: "['Code'] not found in axis"
# What does it mean? in default axis set to zero which means it directing the row axis, while executing, it doesn't find the salary.
# So it throws an error. So we must have to change the axis from the 0 to 1 -> so that given column name Code gets found out and drop. 

df1 = df.drop("Code", axis=1)
display(df1)

# While dropping the column, that column doesn't drop out from the table permanently. 
# If we save that instance in a variable, we can use that like df1. 
# for instance, if i print df again, the code exist again. not dropped put from the table. 
print(df)
# To drop out permanently, we must add an other argument such as inplace=True along with axis=1

df.drop("Code", axis=1, inplace=True) # we have added inplace=True and axis set to 1 means column axis

print(df) # now its gone


# Remember : To have a permanent column drop, 
#            we need to have inplace=True, axis=1 in the drop statement otherwise, it exist.

# Lets say, we wanna add the age to the column df[age]+1 after 1 year

df["Age"] = df["Age"]+1 # this get applied on all the row for the specified column 
print(df)

# Lets say i wanna drop row of the table

drop2ndRow = df.drop(2) #not permanent

# to make the change to get saved in the dataframe
df.drop(2,inplace=True)
print(df)


In [None]:
import pandas as pd
df = pd.read_csv("data_set/data.csv")

display(df.head()) # return first 5 row
display(df.head()) # returns last 5 rows
display(df.dtypes) # return the dtypes of all the columns in general 
display("Statistical Summary: ", df.describe()) # it generally work for the numerical columns only 
display(df.info()) # it generally gives the count and non-null count for each column with dtype and memory usage

# we can group by column or by any number of column
grouped = df.groupby("Category")["Value"].mean()

print(f"The mean value of the category: \n", grouped)
