## Pandas Library: For **Data cleaning** and **Data manipulation** in the data analysis field 
- Pandas is a powerful library in python for data manipulation, widely used for the data analysis and data cleaning. 
- It provide two primary data structure called **Series** and **DataFrame** 
- A Series is a one-dimensional array (1d array object)
- The DataFrame is a two dimensional (2-dimensional), size mutable, potentially heterogeneous tabular data structure with the labelled axes (rows and columns) 

Things to see:
  - Pandas - DataFrames and Series 

In [None]:
import pandas as pd
import numpy as np

# Create the Series which is 1d array like object which can hold any datatype. It is similar to column in a table. 
data=[1,2,3,4,5,6] # its just a list 

# create a series through the list using the Series function in the pd 
series = pd.Series(data)
print("Series:\n",series)
#  the series may looks like an column in the table 
# 0    1
# 1    2
# 2    3
# 3    4
# 4    5
# 5    6
# dtype: int64

print(type(series))
# <class 'pandas.core.series.Series'>

# Lets create the Series from the dictionary
obj1 = {"name": "Nitro", "age":25 , "height": 123.5}
obj2 = {"a": 1, "b":2 , "c": 3}

series1 = pd.Series(obj1)
series2 = pd.Series(obj2)
print(series1)
print(series2)

#  here the index for the list elements are created on its own by pandas 
# In dictionary data, it takes the key as a index for the data 

# if we want to create our own index for the list of element we use in the series creation, we can 
# lets do it
data = [10,20,30,40]
index=["a","b","c","d"]

series_list = pd.Series(data, index=index)
print(series_list)



In [None]:
#  Lets talk about the DataFrame which is 2D mutable and heterogeneous tabular form of data 
# and structured with labeled axes. 
# Main difference between the Series and DataFrame is no of columns in those two data structure
# Series only have one column whereas the DataFrame can have as many as columns we want with the labelled axes

# Create the DataFrame from a dictionary of list 

data = {
  "Name": ["Natarajan","Nitro","Thanaraj","Athistalakshmi"],
  "Age":[32,30,60,23],
  "City": ["Trichy","Ariyalur","Perambalur","Theni"]
}

df = pd.DataFrame(data)

print((type(df))) #  <class 'pandas.core.frame.DataFrame'>

# we can use this dataframe for creating an numpy array by the numpy library
num_arr = np.array(df)
print(num_arr[...,1])
# create an DataFrame from a list of dictionary as well 

data1 = [{"Name":"Natarajan", "Age": 32 , "City": "Trichy"},
         {"Name":"Nitro", "Age": 30 , "City": "Thanjavur"},
         {"Name":"Thanaraj", "Age": 28 , "City": "Ariyalur"},
         {"Name":"Athista", "Age": 20 , "City": "Ariyalur"}
         ]
df1 = pd.DataFrame(data1)
print("df1: \n", df1)

# to get all the rows with respect to the column name called "name"

# to access the column we can index to access the column through the column name of the respective dataframe
name_age = df1["Name"]
age = df1["Age"]
display(name_age)
display(age)

# In general, we often work with bigger dataset files such as xlsx or csv. 
# So to load/read those files into the pandas, we have method called read_csv method in pandas

# to read the csv using the pd.read_csv("path")
df = pd.read_csv("./data_set/data.csv")

# to get the top 5 records
df.head(5)

# to get the last 5 records
df.tail(5)


Where to Use loc[] and iloc[] in Real-World Data Science Tasks?

Answer: 
  - Data Cleaning & Filtering: Extracting specific rows based on conditions.
  - Feature Selection: Selecting specific columns for machine learning models.
  - Data Preprocessing: Modifying data, filling missing values in certain rows.
  - Exploratory Data Analysis (EDA): Selecting subsets of data for visualization.
  - Data Transformation: Creating new calculated fields based on certain conditions.

In [None]:
# The .loc[] and .iloc[] methods in pandas are primarily used for selecting data from a DataFrame 
# based on labels or positions. 

# Here’s where and how you can use them:

#          1. When Selecting Rows and Columns:
# .loc[] is used to select data by label/index names.
# .iloc[] is used to select data by integer positions.
import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Charlie'],
        'Age': [25, 30, 35],
        'Score': [85, 90, 95]}

df = pd.DataFrame(data, index=['a', 'b', 'c'])

# Using loc to select by index label
print(df.loc['a'])  # Select row with index label 'a'

# Using iloc to select by integer position
print(df.iloc[0])  # Select first row (position 0)

#         2. Filtering Data Based on Conditions:
# .loc[] is commonly used for filtering data based on conditions.

# Select all rows where Age > 25
df_filtered = df.loc[df['Age'] > 25]
print(df_filtered)


#         3. Selecting Specific Columns:
# .loc[] can be used to select specific columns by name.
# .iloc[] can be used to select specific columns by position.

# Select 'Name' and 'Score' columns using loc
print(df.loc[:, ['Name', 'Score']])

# Select first two columns using iloc
print(df.iloc[:, [0, 1]])


#        4. Modifying Data:
# .loc[] is useful for updating values in specific rows and columns.

# Update Score for index 'b'
df.loc['b', 'Score'] = 95
print(df)


# 5. Slicing Rows and Columns:
# .iloc[] can be used for slicing by row and column positions.

# Select first two rows and first two columns
print(df.iloc[0:2, 0:2])


# 6. Selecting a Single Value (Scalar Selection):
# You can use .loc[] and .iloc[] to get a specific value.

# Get the value at row 'c' and column 'Score'
print(df.loc['c', 'Score'])

# Get the value at first row, second column
print(df.iloc[0, 1])

In [None]:
import pandas as pd
df = pd.read_csv("data_set/data.csv")
# Accessing the row based on its index and accessing the row value by its column name 
df.loc[0] # it gives the 0 th row
df.loc[0, "Sales"] # it gives the 0th row's Sales Value

# Accessing the row based on its column name  
df.loc[:,"Sales"]

# Taking multiple columns 
df[["Product","Date", "Category" , "Sales" ,"Value"]]

# Filtering & Conditional Selection
df.loc[df["Sales"] < 200] 
df.query("Sales > 700")

# Methods for extracting specific rows based on conditions.

df[df['Sales'] > 800] # Filter rows
df.query("Sales > 300") # Query-based filtering
df[(df['Sales'] > 500) & (df['Value'] <1000)] # Multiple conditions
df["Sales"].isin([df["value"] >700]) # Filter rows matching a list of values
# df[df['Date'].str.contains('NaN')] # Filter string-based values

KeyError: 'value'