## Dividing a large dataset based on the unique values in one of its columns can be useful for various purposes, such as creating subsets of data for individual analysis, parallel processing, or categorizing data. Here’s a step-by-step guide on how to do this using Python and Pandas:

### Steps to Divide a Dataset Based on Unique Column Values
- Identify the column with unique values: Determine which column's unique values will be used to split the dataset.
- Extract unique values: Get a list of all unique values in that column.
- Create subsets: For each unique value, create a subset of the dataset where the column's value matches the unique value.

### Example
- Let's assume we have a dataset df and we want to split it based on the unique values in the column category.

In [5]:
import pandas as pd

# Sample DataFrame
data = {
    'category': ['A', 'B', 'A', 'C', 'B', 'A', 'C'],
    'value': [10, 20, 10, 30, 20, 40, 30]
}
df = pd.DataFrame(data)

# Step 1: Identify the column with unique values
column_name = 'category'

# Step 2: Extract unique values
unique_values = df[column_name].unique()

## Explanation
- Identify the column with unique values: In this example, the column category is used.
- Extract unique values: unique_values contains all unique values in the category column (['A', 'B', 'C']).
- Create subsets: A dictionary subsets is created where each key is a unique value from the category column, and the value is the subset of the DataFrame corresponding to that unique value.

## Accessing the Subsets
- After creating the subsets, you can access them using the unique values as keys:

In [7]:
# Access the subset for category 'A'
subset_A = subsets['A']
print(subset_A)

# Access the subset for category 'B'
subset_B = subsets['B']
print(subset_B)

# Access the subset for category 'C'
subset_C = subsets['C']
print(subset_C)

  category  value
0        A     10
2        A     10
5        A     40
  category  value
1        B     20
4        B     20
  category  value
3        C     30
6        C     30


## Practical Considerations
- **Memory Usage:** Be cautious with memory usage if the dataset is very large. Creating multiple subsets can significantly increase memory usage.
- **Parallel Processing:** If you plan to process each subset separately, consider using parallel processing techniques to improve performance.
- **Data Integrity:** Ensure that the column used for splitting has been preprocessed and cleaned to avoid issues with inconsistent or missing values.
- By following these steps, you can effectively divide a large dataset based on the unique values in a specific column, allowing for more focused and manageable analysis.

In [8]:
import pandas as pd

# Sample DataFrame
data = {
    'category': ['A', 'B', 'A', 'C', 'B', 'A', 'C'],
    'value': [10, 20, 10, 30, 20, 40, 30]
}
df = pd.DataFrame(data)

# Step 1: Identify the column with unique values
column_name = 'category'

# Step 2: Extract unique values
unique_values = df[column_name].unique()

# Step 3: Create subsets
subsets = {}
for value in unique_values:
    subsets[value] = df[df[column_name] == value]

# Now `subsets` is a dictionary where each key is a unique value from the column,
# and the value is the corresponding subset of the DataFrame.

# Access and print the subsets for each unique category
for value in unique_values:
    print(f"Subset for category '{value}':")
    print(subsets[value])
    print()


Subset for category 'A':
  category  value
0        A     10
2        A     10
5        A     40

Subset for category 'B':
  category  value
1        B     20
4        B     20

Subset for category 'C':
  category  value
3        C     30
6        C     30



## In the code above:

- The DataFrame df is created correctly.
- The column_name is identified as category.
- Unique values are extracted using df[column_name].unique().
- Subsets are created and stored in the subsets dictionary.

In [9]:
# Optionally, print a specific subset if needed
# Access the subset for category 'A'
subset_A = subsets['A']
print("Subset for category 'A':")
print(subset_A)

# Access the subset for category 'B'
subset_B = subsets['B']
print("Subset for category 'B':")
print(subset_B)

# Access the subset for category 'C'
subset_C = subsets['C']
print("Subset for category 'C':")
print(subset_C)

Subset for category 'A':
  category  value
0        A     10
2        A     10
5        A     40
Subset for category 'B':
  category  value
1        B     20
4        B     20
Subset for category 'C':
  category  value
3        C     30
6        C     30


## Troubleshooting
### If the code is not working as expected, consider checking the following:

- Ensure the DataFrame df is correctly created and populated.
- Verify the column name category is correct and exists in the DataFrame.
- Check for any errors or exceptions in the code execution, and ensure all necessary libraries (e.g., pandas) are imported and installed.

In [10]:
# Save the subset for category 'A' to a CSV file
subset_A.to_csv('subset_A.csv', index=False)

print("Subset for category 'A' has been saved to subset_A.csv")

Subset for category 'A' has been saved to subset_A.csv


## Explanation
- Create the DataFrame: The sample DataFrame df is created.
- Identify and extract unique values: The unique values in the column category are extracted.
- Create subsets: Subsets are created and stored in the subsets dictionary.
- Access and print subsets: Each subset is printed for verification.
- Save subset to CSV: The subset for category 'A' is saved to a CSV file named subset_A.csv using the to_csv method. The index=False parameter is used to exclude the DataFrame index from the CSV file.

## Result
- After running the code, you should have a CSV file named "subset_A.csv" in your working directory, containing the subset of the DataFrame where the category is 'A'.

### The .csv file will be stored in the current working directory of your script or notebook. The current working directory is the directory from which the script is run or where the Jupyter notebook is located.

- To determine the current working directory in your script or notebook, you can use the following code:

In [12]:
Subset_A = subset_A.to_csv('C:\\Users\\user\\Desktop\\NEET DATA_ Analysis\\subset_A.csv', index=False)

In [11]:
import os

# Get the current working directory
current_directory = os.getcwd()
print(f"The CSV file will be saved in: {current_directory}")

The CSV file will be saved in: C:\Users\user\__ Machine_Learning\NEET UG 2024_Toy Data Set


## This will print the path to the directory where the CSV file will be saved.

- If you want to save the CSV file to a specific directory, you can provide the full path to the to_csv method. For example:

In [None]:
# Save the subset for category 'A' to a specific directory
subset_A.to_csv('/path/to/your/directory/subset_A.csv', index=False)

## Replace '/path/to/your/directory/' with the desired directory path where you want to save the file.

- Example Code with Directory Check
- Here is the updated code with the addition of checking and printing the current working directory:

In [None]:
import pandas as pd
import os

# Sample DataFrame
data = {
    'category': ['A', 'B', 'A', 'C', 'B', 'A', 'C'],
    'value': [10, 20, 10, 30, 20, 40, 30]
}
df = pd.DataFrame(data)

# Step 1: Identify the column with unique values
column_name = 'category'

# Step 2: Extract unique values
unique_values = df[column_name].unique()

# Step 3: Create subsets
subsets = {}
for value in unique_values:
    subsets[value] = df[df[column_name] == value]

# Now `subsets` is a dictionary where each key is a unique value from the column,
# and the value is the corresponding subset of the DataFrame.

# Access and print the subsets for each unique category
for value in unique_values:
    print(f"Subset for category '{value}':")
    print(subsets[value])
    print()

# Access the subset for category 'A'
subset_A = subsets['A']
print("Subset for category 'A':")
print(subset_A)

# Get and print the current working directory
current_directory = os.getcwd()
print(f"The CSV file will be saved in: {current_directory}")

# Save the subset for category 'A' to a CSV file
subset_A.to_csv('subset_A.csv', index=False)

print("Subset for category 'A' has been saved to subset_A.csv")


## Running this code will print the path to the current working directory, and you can confirm where the subset_A.csv file will be stored.

In [None]:
# df = pd.read_csv("C:\\Users\\user\\Desktop\\NEET DATA_ Analysis\\NEET_2024_RESULTS.csv")

In [None]:
# Export DataFrame to Excel
### file_path = "Segregated_data.xlsx"
### df.to_excel("C:\\Users\\user\\Desktop\\Machine Learning\\Boston_Housing.xlsx", index=False)
#### Syntax  --- df1.to_excel("C:\\Users\\user\\Desktop\\Machine Learning\\Boston_Housing.xlsx")
### print("Data has been exported to:", file_path)
df_final = pd.concat([df,df1,df2,df3],axis = 1)
df_final
df_final.to_excel("C:\\Users\\user\\Desktop\\Machine Learning\\Exam_copy.xlsx", index=False)