In [None]:
# Import necessary libraries
import pandas as pd

# Load dataset (already loaded previously, shown here for completeness)
df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Mathematics for Data Science/Gender pay gap.csv')

# --- MEAN ---
# Calculate the mean base salary
mean_basePay = df['basePay'].mean()
print("Mean Base Pay:", mean_basePay)

# --- MEDIAN ---
# Calculate the median age
median_age = df['age'].median()
print("Median Age:", median_age)

# --- MODE ---
# Calculate the mode of job titles
mode_jobTitle = df['jobTitle'].mode()[0]
print("Mode Job Title:", mode_jobTitle)

# --- OPTIONAL: Grouping Example (Extra) ---
# Mean basePay by gender
grouped_mean_salary = df.groupby('gender')['basePay'].mean()
print("\nMean Base Pay by Gender:\n", grouped_mean_salary)

Mean Base Pay: 94472.653
Median Age: 41.0
Mode Job Title: Marketing Associate

Mean Base Pay by Gender:
 gender
Female    89942.818376
Male      98457.545113
Name: basePay, dtype: float64


In [None]:
The code analyzes a dataset on gender pay gap using descriptive statistics.  The selection of `basePay`, `age`, and `jobTitle` attributes is driven by common exploratory data analysis practices and the likely relevance to understanding pay disparity.

* **`basePay`**:  This is the central attribute for investigating the gender pay gap. Calculating its *mean* provides a general overview of average salary, crucial for initial comparisons across demographics.  Further analysis (shown in the optional grouping example) uses `basePay` alongside `gender` to directly compare mean salaries between genders, revealing potential pay discrepancies.

* **`age`**:  Age often correlates with salary and experience.  The *median* age is used instead of the mean as it's less sensitive to outliers, providing a more robust representation of the typical employee age within the dataset.  Understanding the age distribution is essential, as differences in average age between gender groups could influence pay gap observations.

* **`jobTitle`**:  Job title strongly influences salary.  The *mode* provides the most frequent job title in the dataset, highlighting the dominant role within the organization.  While the mode itself may not directly reveal pay disparities, further investigation using job title as a grouping factor could reveal significant variations in pay across different roles and potential gender representation within those roles.

The optional grouping by `gender` further enhances the initial analysis by directly comparing the mean base pay between different genders. This gives more focused insights into salary differences based on gender, building upon the general statistics calculated earlier. The selection of descriptive statistics (mean, median, mode) is appropriate for summarizing the respective attributes and identifying potential trends for further investigation.


In [None]:
# Import necessary libraries
import pandas as pd

# Load dataset (HR Dataset)
hr_df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Mathematics for Data Science/HR_Dataset.csv')

# --- MEAN ---
# Calculate the mean salary
mean_salary = hr_df['Salary'].mean()
print("Mean Salary:", mean_salary)

# --- MEDIAN ---
# Calculate the median number of absences
median_absences = hr_df['Absences'].median()
print("Median Absences:", median_absences)

# --- MODE ---
# Calculate the mode of employee positions
mode_position = hr_df['Position'].mode()[0]
print("Mode Position:", mode_position)

# --- OPTIONAL: Grouping Example (Extra) ---
# Mean salary by Department
grouped_mean_salary_by_dept = hr_df.groupby('Department')['Salary'].mean()
print("\nMean Salary by Department:\n", grouped_mean_salary_by_dept)


Mean Salary: 69020.6848874598
Median Absences: 10.0
Mode Position: Production Technician I

Mean Salary by Department:
 Department
Admin Offices            71791.888889
Executive Office        250000.000000
IT/IS                    97064.640000
Production               59953.545455
Sales                    69061.258065
Software Engineering     94989.454545
Name: Salary, dtype: float64


In [None]:
# prompt: In the above code, discuss about the selected attribute(s) for the purpose and rational behind it
# within 200- 250 words

The provided code analyzes two datasets: a gender pay gap dataset and an HR dataset, using descriptive statistics.  The choice of attributes and descriptive statistics is again driven by the goal of exploring potential trends and disparities within the data.  Let's examine the attributes and statistics used in each:

**Gender Pay Gap Dataset:**

*   `basePay`:  This remains the key attribute, and its mean is crucial for establishing a baseline average salary. The subsequent grouping by `gender` directly addresses the central research question of pay disparity.

*   `age`:  Median age provides a robust representation of the typical employee age, considering potential outliers in age and salary. The rationale is that age and experience usually correlate with salary, so the age distribution's effect on the pay gap needs to be considered.

*   `jobTitle`: The mode of `jobTitle` identifies the most common job role, providing context for the overall salary analysis. Further investigation would involve analyzing salary distributions by `jobTitle`, considering gender representation within roles to assess potential disparities.

**HR Dataset:**

*   `Salary`:  The mean salary is a standard measure used to establish a general overview of compensation levels within the organization.

*   `Absences`:  The median number of absences provides a robust measure of employee absenteeism, less susceptible to outliers compared to the mean. This is relevant for HR analytics, potentially indicating employee well-being or performance issues.

*   `Position`: The mode of `Position` indicates the most prevalent job role in the organization, serving as a point of reference for understanding the overall workforce composition.  Further analyses could investigate how salary or absenteeism varies across different positions.


In both cases, the selection of descriptive statistics (mean, median, mode) caters to identifying central tendencies and distributions for respective attributes. The additional grouping by `gender` (in the gender pay gap dataset) and by `Department` (in the HR dataset) allows for more granular analysis to understand how the key variables vary across groups, uncovering potential patterns and disparities.


In [None]:
# prompt: for the above data set  /content/drive/MyDrive/Colab Notebooks/Mathematics for Data Science/Asian_American_Quality_of_Life.csv
#  Selecting attribute(s) to perform the following “Measure of central tendency”:
# • Mean, Median and Mode.

import pandas as pd

# Load the Asian American Quality of Life dataset
df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Mathematics for Data Science/Asian_American_Quality_of_Life.csv')

# Select relevant numerical columns for central tendency measures
numerical_cols = df.select_dtypes(include=['number']).columns

# --- MEAN ---
# Calculate the mean for each numerical column
mean_values = df[numerical_cols].mean()
print("Mean:\n", mean_values)

# --- MEDIAN ---
# Calculate the median for each numerical column
median_values = df[numerical_cols].median()
print("\nMedian:\n", median_values)

# --- MODE ---
# Calculate the mode for each numerical column
mode_values = df[numerical_cols].mode().iloc[0] # Take the first mode if multiple exist
print("\nMode:\n", mode_values)


Mean:
 Survey ID                  2.105820e+06
Age                        4.285192e+01
Education Completed        1.507815e+01
Household Size             3.295139e+00
Grandparent                0.000000e+00
Other Relative             0.000000e+00
Other                      0.000000e+00
Self Employed Full Time    0.000000e+00
Self Employed Part Time    0.000000e+00
Disabled                   0.000000e+00
Unemployed                 0.000000e+00
Other Employement          0.000000e+00
Achieving Ends Meet        1.722449e-01
Duration of Residency      1.563509e+01
Primary Language           6.675697e-01
Discrimination             3.027923e-01
Hygiene Assistance         2.705837e-02
Smoking                    6.138996e-02
Drinking                   3.320463e-02
Regular Exercise           6.178737e-01
Healthy Diet               8.086420e-01
Heart Disease              0.000000e+00
Stroke                     0.000000e+00
Cancer                     0.000000e+00
Hepatitis                  0.0000

In [None]:
# prompt: In the above code, discuss about the selected attribute(s) for the purpose and rational behind it
# within 200- 250 words

# The code analyzes the Asian American Quality of Life dataset using descriptive statistics.
# The selection of all numerical columns is a broad approach aiming to provide an overview of central tendencies across various aspects of quality of life.

# --- MEAN ---
# Calculating the mean for each numerical column provides an average value for each factor. This helps understand the typical level of each quality of life indicator.
# For example, the mean income provides an overview of the average income in the dataset.

# --- MEDIAN ---
# The median is used because it's a robust measure less affected by outliers.  Certain quality of life indicators might have skewed distributions (e.g., income or wealth).
# The median offers a better representation of the 'typical' value in the presence of extreme values, providing a more balanced view compared to the mean.

# --- MODE ---
# The mode reveals the most frequent value for each numerical column. This can be particularly insightful for categorical variables that have been numerically coded.
# In the context of a quality of life dataset, the mode might indicate the most common level of education or a prevalent housing situation.
# For continuous numerical variables, the mode may pinpoint concentration points within the distribution.  However, for truly continuous data (e.g. income or age), the mode may not be meaningful and it may be more useful to consider other measures of central tendency.  The code handles cases with multiple modes by taking the first occurring mode.

# Overall, the comprehensive analysis using mean, median, and mode for all numerical columns facilitates a thorough initial exploration of the dataset.
# This exploration provides a foundational understanding of the distribution and central tendencies of various quality of life aspects within the Asian American community.  Further investigation will involve other statistical approaches and potentially visualization techniques.


In [None]:
# prompt: /content/drive/MyDrive/Colab Notebooks/Mathematics for Data Science/Provisional_COVID-19_Deaths.csv
# Selecting attribute(s) to perform the following “Measure of central tendency”:
# • Mean, Median and Mode.

import pandas as pd

# Load the dataset
df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Mathematics for Data Science/Provisional_COVID-19_Deaths.csv')

# Select the columns you want to analyze (replace with your actual column names)
columns_to_analyze = ['COVID-19 Deaths', 'Total Deaths'] # Example columns, replace with your choices

# Calculate Mean, Median, and Mode for each selected column
for column in columns_to_analyze:
    if pd.api.types.is_numeric_dtype(df[column]):
        mean_val = df[column].mean()
        median_val = df[column].median()
        mode_val = df[column].mode().iloc[0] if not df[column].mode().empty else "No unique mode"  # Handle cases with no unique mode

        print(f"Results for column: {column}")
        print(f"Mean: {mean_val}")
        print(f"Median: {median_val}")
        print(f"Mode: {mode_val}")
        print("-" * 20)
    else:
        print(f"Column '{column}' is not numeric, cannot calculate mean, median, and mode.")


Results for column: COVID-19 Deaths
Mean: 367.0516483376448
Median: 10.0
Mode: 0.0
--------------------
Results for column: Total Deaths
Mean: 2830.7697103562336
Median: 150.0
Mode: 0.0
--------------------


In [None]:
# prompt: In the above code, discuss about the selected attribute(s) for the purpose and rational behind it
# within 200- 250 words

The code analyzes multiple datasets using descriptive statistics (mean, median, mode). The choice of attributes and statistics depends on the dataset and the goals of the analysis.

**General Rationale:**

*   **Mean:** Provides the average value, useful for understanding the typical value of a numerical attribute. However, it's sensitive to outliers.

*   **Median:** Represents the middle value when data is ordered, less sensitive to outliers than the mean. It provides a more robust measure of central tendency, especially when dealing with skewed distributions.

*   **Mode:** Indicates the most frequent value. It's particularly useful for categorical data or identifying peaks in the distribution of numerical data.  For continuous data, the mode might not be as meaningful.

**Dataset-Specific Rationale:**

1.  **Gender Pay Gap Dataset:**  `basePay`, `age`, and `jobTitle` are selected to explore potential gender-based salary differences. `basePay` (mean) is the central measure. `age` (median) accounts for potential outliers. `jobTitle` (mode) provides context regarding the most prevalent job role. Grouping by gender allows direct comparison.

2.  **HR Dataset:** `Salary` (mean), `Absences` (median), and `Position` (mode) are chosen to understand employee compensation, absenteeism, and job role distribution. Median `Absences` mitigates outlier effects.  Grouping by department provides additional insight.

3.  **Asian American Quality of Life Dataset:** All numerical columns are analyzed to give an overview of several quality of life aspects.  Both mean and median are used because some indicators may have skewed distributions. The mode provides information about frequent values, especially for potentially categorical or discrete variables.

4.  **Provisional COVID-19 Deaths Dataset:**  The code calculates mean, median, and mode for specified columns (`COVID-19 Deaths`, `Total Deaths`).  The choice of columns depends on the specific research questions. The inclusion of a check for numeric data type ensures the code handles non-numeric columns gracefully, preventing errors.


In summary, the choice of descriptive statistics is determined by the nature of the data and the research questions.  The code uses mean to represent typical values, median for robustness against outliers, and mode to identify frequent values, giving a comprehensive initial exploration of the datasets. The additional grouping in some cases allows deeper analysis of the relationships between variables.


In [None]:
# prompt: Selecting attribute(s) to perform the following “Measure of Spread/Dispersion”: Range,
# Variance and Standard Deviation.
# perform with the above all data sets with their paths

import pandas as pd

# Load datasets (replace with your actual file paths if different)
file_paths = [
    '/content/drive/MyDrive/Colab Notebooks/Mathematics for Data Science/Gender pay gap.csv',
    '/content/drive/MyDrive/Colab Notebooks/Mathematics for Data Science/HR_Dataset.csv',
    '/content/drive/MyDrive/Colab Notebooks/Mathematics for Data Science/Asian_American_Quality_of_Life.csv',
    '/content/drive/MyDrive/Colab Notebooks/Mathematics for Data Science/Provisional_COVID-19_Deaths.csv'
]

for file_path in file_paths:
    try:
        df = pd.read_csv(file_path)
        print(f"Analyzing: {file_path}")

        # Select numerical columns (excluding object types which are likely categorical)
        numerical_cols = df.select_dtypes(include=['number']).columns

        for col in numerical_cols:
            if len(df[col].dropna()) > 0 : # Check if there are any numerical values to avoid errors.
              # Calculate Range, Variance, and Standard Deviation
              data_range = df[col].max() - df[col].min()
              variance = df[col].var()
              std_dev = df[col].std()

              print(f"Column: {col}")
              print(f"  Range: {data_range}")
              print(f"  Variance: {variance}")
              print(f"  Standard Deviation: {std_dev}")
              print("-" * 20)
        print("\n") # Add extra line break for better readability between files
    except FileNotFoundError:
        print(f"Error: File not found at {file_path}")
    except Exception as e:
        print(f"An error occurred while processing {file_path}: {e}")


Analyzing: /content/drive/MyDrive/Colab Notebooks/Mathematics for Data Science/Gender pay gap.csv
Column: age
  Range: 47
  Variance: 204.3428938938925
  Standard Deviation: 14.294855504477564
--------------------
Column: perfEval
  Range: 4
  Variance: 2.027658658658644
  Standard Deviation: 1.423958798090255
--------------------
Column: seniority
  Range: 4
  Variance: 1.9461051051051323
  Standard Deviation: 1.3950287112117559
--------------------
Column: basePay
  Range: 145518
  Variance: 641988565.313904
  Standard Deviation: 25337.493272103773
--------------------
Column: bonus
  Range: 9590
  Variance: 4017528.6196986963
  Standard Deviation: 2004.377364594476
--------------------


Analyzing: /content/drive/MyDrive/Colab Notebooks/Mathematics for Data Science/HR_Dataset.csv
Column: EmpID
  Range: 310
  Variance: 8086.0
  Standard Deviation: 89.92218858546538
--------------------
Column: MarriedID
  Range: 1
  Variance: 0.24051446945337662
  Standard Deviation: 0.49042274565254

In [None]:
# prompt: In the above code, discuss about the selected attribute(s) for the purpose and rationale behind it within 200- 250 words for each dataset in the above result

The provided code calculates the range, variance, and standard deviation for numerical columns in multiple datasets.  Let's analyze the rationale behind these choices:

**Range:**

The *range* provides a simple measure of the spread of the data, representing the difference between the maximum and minimum values.  It's easy to understand and calculate but highly sensitive to outliers.  A large range indicates a wide dispersion of values, while a small range suggests that values are clustered closely together. In the context of these datasets, the range helps visualize the extent of variation within each numerical attribute. For instance, a large range in income would indicate a significant disparity in income levels within the dataset.

**Variance:**

The *variance* measures the average squared deviation of each data point from the mean. It quantifies the dispersion or spread of data points around the mean. A higher variance indicates a greater spread of data, meaning the data points are farther away from the mean on average. Variance is less susceptible to outliers compared to range but can be influenced by the units of measurement. For datasets like the Asian American Quality of Life or the Gender Pay Gap dataset, variance could reveal how dispersed the quality of life indicators or salaries are, respectively. A higher variance in salaries, for instance, would suggest greater inequality in income levels.


**Standard Deviation:**

The *standard deviation* is the square root of the variance.  It expresses the data's spread in the same units as the original data, making it more interpretable than variance. Like variance, it reflects how much the data points deviate from the mean.  A larger standard deviation means the data is more spread out. For example, a large standard deviation in COVID-19 deaths could indicate significant fluctuations in mortality across different regions or time periods. Similarly, in the HR dataset, a higher standard deviation in salary might signify a greater disparity in compensation among employees.

**Rationale for Selecting These Measures:**

These three measures (range, variance, and standard deviation) provide a comprehensive understanding of the dispersion or spread of data around the central tendency.  While the *range* offers a simple measure of spread, *variance* and *standard deviation* are more robust to outlier effects and provide a more nuanced view of how data points are distributed. *Standard deviation*, expressed in the same units as the original data, is particularly helpful for interpretation.

Together, range, variance, and standard deviation provide valuable information about the variability within each numerical attribute across datasets, allowing a more detailed analysis than central tendency measures alone.  They are essential for understanding the distribution and spread of the data. The code's check for numerical columns ensures that these calculations are applied appropriately and that errors due to non-numeric data are avoided.


In [None]:
# prompt: Selecting at least TWO attributes appropriately to construct confidence intervals with the above data sets and wit their paths

import pandas as pd
import numpy as np
from scipy import stats

def calculate_confidence_interval(data, confidence=0.95):
    """Calculates the confidence interval for a given dataset."""
    if len(data) < 2:  # need at least two data points
        return None
    a = 1.0 * np.array(data)
    n = len(a)
    m, se = np.mean(a), stats.sem(a)
    h = se * stats.t.ppf((1 + confidence) / 2., n-1)
    return m-h, m+h

# Load the datasets (replace with your actual file paths)
file_paths = [
    '/content/drive/MyDrive/Colab Notebooks/Mathematics for Data Science/Gender pay gap.csv',
    '/content/drive/MyDrive/Colab Notebooks/Mathematics for Data Science/HR_Dataset.csv',
    '/content/drive/MyDrive/Colab Notebooks/Mathematics for Data Science/Asian_American_Quality_of_Life.csv',
    '/content/drive/MyDrive/Colab Notebooks/Mathematics for Data Science/Provisional_COVID-19_Deaths.csv'
]

for file_path in file_paths:
    try:
        df = pd.read_csv(file_path)
        print(f"Analyzing: {file_path}")

        # Select numerical columns (excluding object types which are likely categorical)
        numerical_cols = df.select_dtypes(include=['number']).columns

        for col in numerical_cols:
            if len(df[col].dropna()) >= 2 : # Check if there are at least two numerical values to avoid errors.
                # Calculate the confidence interval
                data = df[col].dropna()
                confidence_interval = calculate_confidence_interval(data)

                if confidence_interval:
                    lower_bound, upper_bound = confidence_interval
                    print(f"Column: {col}")
                    print(f"  95% Confidence Interval: ({lower_bound:.2f}, {upper_bound:.2f})")
                    print("-" * 20)
        print("\n")
    except FileNotFoundError:
        print(f"Error: File not found at {file_path}")
    except Exception as e:
        print(f"An error occurred while processing {file_path}: {e}")


Analyzing: /content/drive/MyDrive/Colab Notebooks/Mathematics for Data Science/Gender pay gap.csv
Column: age
  95% Confidence Interval: (40.51, 42.28)
--------------------
Column: perfEval
  95% Confidence Interval: (2.95, 3.13)
--------------------
Column: seniority
  95% Confidence Interval: (2.88, 3.06)
--------------------
Column: basePay
  95% Confidence Interval: (92900.34, 96044.96)
--------------------
Column: bonus
  95% Confidence Interval: (6342.78, 6591.54)
--------------------


Analyzing: /content/drive/MyDrive/Colab Notebooks/Mathematics for Data Science/HR_Dataset.csv
Column: EmpID
  95% Confidence Interval: (10145.97, 10166.03)
--------------------
Column: MarriedID
  95% Confidence Interval: (0.34, 0.45)
--------------------
Column: MaritalStatusID
  95% Confidence Interval: (0.71, 0.92)
--------------------
Column: GenderID
  95% Confidence Interval: (0.38, 0.49)
--------------------
Column: EmpStatusID
  95% Confidence Interval: (2.19, 2.59)
--------------------
Co

In [None]:


The provided code calculates confidence intervals for numerical columns in multiple datasets.
The choice of attributes is driven by the need to estimate the range within which the true population parameter (e.g., mean) is likely to fall.
The rationale behind using confidence intervals is to provide a measure of uncertainty around sample statistics.


Rationale for Confidence Intervals:

1.  Uncertainty Quantification:  A sample statistic (e.g., the mean of a sample) is an estimate of the true population parameter.
     A confidence interval acknowledges that the sample statistic is not perfectly accurate and provides a range within which the true population parameter is likely to lie.


 2.  Hypothesis Testing:  Confidence intervals are closely related to hypothesis testing.  If a hypothesized value falls outside the confidence interval,
     it suggests that the hypothesis is unlikely to be true.


 3.  Data Variability:  The width of the confidence interval reflects the variability in the data. A wider interval indicates more uncertainty,
     while a narrower interval suggests greater precision in the estimate.


 4.  Sample Size:  The sample size influences the width of the confidence interval.  Larger sample sizes typically result in narrower intervals because they provide more precise estimates of the population parameter.

 5.  Population Distribution:  The assumption about the underlying population distribution (usually normality) impacts the calculation of the confidence interval.  For large sample sizes, the Central Limit Theorem ensures that the sample mean will be approximately normally distributed, regardless of the underlying population distribution.


 Attribute Selection and Interpretation:

 In this code, confidence intervals are calculated for all numerical columns in each dataset.  The specific interpretation of these intervals depends on the context of each column within its dataset:

 1. Gender Pay Gap Dataset:  Confidence intervals around salary (`basePay` or similar attributes) for each gender would help determine if there's a statistically significant difference in mean salaries between genders.  A non-overlapping confidence interval would suggest a significant difference.


 2.  HR Dataset:  Confidence intervals around `Salary`, `Absences`, or other numerical features could be used to understand the uncertainty associated with the average values.  This helps in drawing conclusions about the employees in this organization.

 3.  Asian American Quality of Life Dataset: Confidence intervals would provide a sense of precision for the calculated means of different quality-of-life indicators, such as income or education levels.  These intervals help us understand whether the averages observed in the sample are good representations of the actual values in the population.

 4.  Provisional COVID-19 Deaths Dataset: Confidence intervals around the number of deaths would estimate the range within which the true population mean lies, taking into account the sampling variability.

 In summary, by calculating confidence intervals for various numerical attributes, we not only obtain point estimates but also quantify our confidence in these estimates, enabling more robust conclusions about the datasets.




In [None]:
# prompt: Selecting at least TWO attributes appropriately to perform hypothesis testing
# perform with above data sets with their paths

import pandas as pd
from scipy import stats

# Load the datasets
file_paths = [
    '/content/drive/MyDrive/Colab Notebooks/Mathematics for Data Science/Gender pay gap.csv',
    '/content/drive/MyDrive/Colab Notebooks/Mathematics for Data Science/HR_Dataset.csv',
    '/content/drive/MyDrive/Colab Notebooks/Mathematics for Data Science/Asian_American_Quality_of_Life.csv',
    '/content/drive/MyDrive/Colab Notebooks/Mathematics for Data Science/Provisional_COVID-19_Deaths.csv'
]

for file_path in file_paths:
    try:
        df = pd.read_csv(file_path)
        print(f"Analyzing: {file_path}")

        # Select two numerical columns for hypothesis testing (replace with your choices)
        #  Ensure these columns are appropriate for the chosen test (e.g., normally distributed)
        #  and that there are enough samples

        numerical_cols = df.select_dtypes(include=['number']).columns
        if len(numerical_cols) >= 2:  # Check if at least two numerical columns exist
            col1 = numerical_cols[0]
            col2 = numerical_cols[1]

            # Perform an independent samples t-test (assuming two groups)
            # Adapt the test based on your specific hypothesis and data
            if len(df[col1].dropna()) >= 2 and len(df[col2].dropna()) >= 2: # Check for sufficient data points.
                t_statistic, p_value = stats.ttest_ind(df[col1].dropna(), df[col2].dropna())

                print(f"T-test for {col1} vs. {col2}")
                print(f"  T-statistic: {t_statistic:.3f}")
                print(f"  P-value: {p_value:.3f}")

                alpha = 0.05  # Significance level
                if p_value < alpha:
                    print("  Reject the null hypothesis: There is a statistically significant difference between the means.")
                else:
                    print("  Fail to reject the null hypothesis: There is no statistically significant difference between the means.")
            else:
                print(f"Insufficient data points in {col1} or {col2} to perform t-test")
        else:
            print(f"Not enough numerical columns in {file_path} to perform hypothesis testing.")
        print("-" * 20)
        print("\n")
    except FileNotFoundError:
        print(f"Error: File not found at {file_path}")
    except Exception as e:
        print(f"An error occurred while processing {file_path}: {e}")


Analyzing: /content/drive/MyDrive/Colab Notebooks/Mathematics for Data Science/Gender pay gap.csv
T-test for age vs. perfEval
  T-statistic: 84.432
  P-value: 0.000
  Reject the null hypothesis: There is a statistically significant difference between the means.
--------------------


Analyzing: /content/drive/MyDrive/Colab Notebooks/Mathematics for Data Science/HR_Dataset.csv
T-test for EmpID vs. MarriedID
  T-statistic: 1991.648
  P-value: 0.000
  Reject the null hypothesis: There is a statistically significant difference between the means.
--------------------


Analyzing: /content/drive/MyDrive/Colab Notebooks/Mathematics for Data Science/Asian_American_Quality_of_Life.csv
T-test for Survey ID vs. Age
  T-statistic: 4.565
  P-value: 0.000
  Reject the null hypothesis: There is a statistically significant difference between the means.
--------------------


Analyzing: /content/drive/MyDrive/Colab Notebooks/Mathematics for Data Science/Provisional_COVID-19_Deaths.csv
T-test for Year v

In [None]:

The provided code analyzes multiple datasets using a variety of statistical methods.  The selection of attributes and statistical tests depends heavily on the nature of the data within each dataset and the research questions being addressed. The rationale behind the attribute and statistical test selections are discussed below.

**Descriptive Statistics (Mean, Median, Mode, Range, Variance, Standard Deviation):**

These measures provide a basic understanding of the data's central tendency and dispersion. The choice of attributes is generally straightforward, as they apply to numerical columns within each dataset. The rationale behind each statistic:

* **Mean:** Represents the average value, giving a sense of the typical value but is sensitive to outliers.
* **Median:** The middle value when the data is sorted.  More robust than the mean to outliers.
* **Mode:** Identifies the most frequent value; particularly helpful for categorical or discrete data.
* **Range:**  A simple measure of spread; difference between maximum and minimum values, however, highly sensitive to outliers.
* **Variance & Standard Deviation:**  Quantify the spread of data around the mean. Standard deviation is more interpretable because it's in the original data units.

**Confidence Intervals:**

Confidence intervals provide a range of values within which the true population parameter (like the mean) likely lies, given the sample data.  The width of the interval reflects uncertainty.  Attributes for confidence intervals are again numerical variables.  A key rationale is to understand the uncertainty associated with sample statistics.

**Hypothesis Testing (T-test):**

Hypothesis testing determines whether there's a statistically significant difference between the means of two groups.  In this case, the code performs an independent samples t-test.  The *crucial* requirement for a t-test is that the data should approximately follow a normal distribution.  Attribute selection hinges on identifying two numerical columns that are suitable for comparison (e.g., income differences between two demographic groups). The rationale behind the t-test is to determine whether any observed difference between the groups is likely due to chance or a real effect.  Failing to meet the normality assumption could lead to inaccurate results.  The p-value helps determine if the difference is statistically significant.

**Overall Rationale:**

1.  **Exploratory Data Analysis:** Descriptive statistics (mean, median, mode, range, variance, standard deviation) and confidence intervals initially help to understand the data’s central tendency, dispersion, and uncertainty.

2.  **Comparative Analysis:** Hypothesis testing allows comparison between two groups and assesses whether any differences are statistically significant.

3.  **Dataset Specificity:** Appropriate attribute selection is paramount.  For instance, using the t-test on non-normally distributed data or on data that violate the other assumptions of the test would lead to incorrect conclusions.  Appropriate checks and alternative tests are needed if these assumptions are not met.  Careful data exploration should precede hypothesis tests to understand the data distribution and confirm or reject these assumptions.
