**Data manipulation is a fundamental aspect of the machine learning workflow, as the quality and format of the data significantly impact the performance of models. Here's a more detailed explanation of data manipulation in the context of Python programming for machine learning:**

Pandas is a powerful Python library for data manipulation and analysis. One of its key data structures is the DataFrame, which is akin to a table or spreadsheet with rows and columns.


Pandas provides functions like read_csv(), read_json(), read_excel(), and read_sql() to read data from different sources and create DataFrame objects.

# Indexing and Selection:

In [19]:
import pandas as pd

# Create a sample DataFrame
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}, index=['x', 'y', 'z'])
print(df)



   A  B
x  1  4
y  2  5
z  3  6


In [5]:
# Label-based indexing using loc[]
print(df.loc['x', 'A'])



1


In [6]:
# Position-based indexing using iloc[]
print(df.iloc[0, 1])



4


In [7]:
# Boolean indexing
print(df[df['A'] > 1])

   A  B
y  2  5
z  3  6


# Data Manipulation:

In [20]:
# Adding and Removing Columns
df['C'] = [7, 8, 9]
df.drop(columns=['B'], inplace=True)
print(df)

# Sorting
df.sort_values(by='A', inplace=True)

# Reshaping
stacked = df.stack()

# Merging and Joining
df2 = pd.DataFrame({'A': [4, 5, 6], 'D': [10, 11, 12]}, index=['x', 'y', 'z'])
merged_df = df.merge(df2, on='A')

# Grouping
grouped = df.groupby('A').sum()

# Applying Functions
df['A_squared'] = df['A'].apply(lambda x: x**2)


   A  C
x  1  7
y  2  8
z  3  9


# Data Cleaning:

In [None]:
# Missing Data Handling
df.loc['y', 'A'] = pd.NA
print(df.isna())
df.fillna(0, inplace=True)

# Duplicated Data
df.loc['y'] = df.loc['x']  # Create a duplicate row
print(df.duplicated())
df.drop_duplicates(inplace=True)

# Data Transformation
df['A'] = df['A'].astype(float)

# String Operations
df['label'] = ['foo', 'bar', 'baz']
print(df['label'].str.contains('b'))


# Data Cleaning:

In [None]:
# Missing Data Handling
df.loc['y', 'A'] = pd.NA
print(df.isna())
df.fillna(0, inplace=True)

# Duplicated Data
df.loc['y'] = df.loc['x']  # Create a duplicate row
print(df.duplicated())
df.drop_duplicates(inplace=True)

# Data Transformation
df['A'] = df['A'].astype(float)

# String Operations
df['label'] = ['foo', 'bar', 'baz']
print(df['label'].str.contains('b'))
