# Understanding the Data

Y -> There are 2 targets
 1. ADHD diagnosis (0- None, 1- ADHD)
 2. Sex (0- Male, 1 - Female)

X ->
1. Metadata A + B = Socio Demographics Information (Quantitative + Categorical)
2. Functional MRI Matrices  (Dont Touch)

Combine these 2 for EDA and Model Training.

---

### Data Files

- LABELS: ADHD diagnosis and sex.
- fMRI Connectome Matrices: High-dimensional numerical data representing brain connectivity.
- Socio-Demographic Data: A mix of categorical and numerical features (e.g., handedness, education level, parenting info, emotions).

---

Others

- Instrument_Description - Description about the types of Tests taken.
- EHQ – Edinburgh Handedness Questionnaire
- APQ –Alabama Parenting Questionnaire
- SDQ – Strength and Difficulties Questionnaire.

I believe these things are for knowledge and may be useful for features development and understanding.

---

# Exploratory Data Analysis (EDA)

✅ Missing data should be handled before checking relationships/correlations.

✅ Normalization/standardization should be done after checking relationships/correlations to avoid distorting the data.

## 1️⃣ Deeper Understanding of Data


✅ Steps:

- Load the dataset (CSV, Excel, SQL, JSON, etc.).
- Check the data structure: Number of rows & columns (df.shape).
- Inspect column names & data types (df.info()).
- Check the first few rows (df.head()).
Look at variable descriptions (refer to data dictionary).

### 2️⃣ Data Cleaning & Handling Missing Values

💡 Goal: Fix inconsistencies, deal with NaNs, and clean the data

✅ Steps:

- Identify missing values (df.isnull().sum()).
- Decide how to handle NaNs:
-- Drop rows (df.dropna()).
-- Fill with mean/median (df.fillna(df.mean())).
-- Fill categorical NaNs with mode (df.fillna(df.mode()[0])).

- Remove duplicates (df.duplicated().sum() → df.drop_duplicates()).
- Fix inconsistencies (e.g., standardize categorical values: "Male"/"male" → "Male").

### 3️⃣ Feature Understanding & Exploration

💡 Goal: Analyze the features, distributions, and relationships

✅ Steps:

### A. Univariate Analysis (Individual Feature Analysis)

📊 For numerical variables:

- Summary statistics (df.describe()).
- Histograms (sns.histplot(df['column'])).
- Box plots (to check for outliers).

📊 For categorical variables:

- Value counts (df['column'].value_counts()).
- Bar plots (sns.countplot(x=df['column'])).

### B. Bivariate Analysis (Feature-to-Feature Relationships)
📌 Numerical vs. Numerical:

- Correlation matrix (df.corr(), sns.heatmap(df.corr())).
- Scatter plots (sns.scatterplot(x='feature1', y='feature2', data=df)).

📌 Categorical vs. Numerical:

- Boxplots (sns.boxplot(x='category', y='numerical', data=df)).
- Groupby statistics (df.groupby('category')['numerical'].mean()).

📌 Categorical vs. Categorical:

- Cross-tabulation (pd.crosstab(df['cat1'], df['cat2'])).
- Stacked bar charts.

### 4️⃣ Handling Outliers & Transformations

💡 Goal: Detect and manage extreme values for better model performance

✅ Steps:

- Detect outliers using box plots or Z-score (scipy.stats.zscore()).
- Remove or cap extreme values (df[df['column'] < threshold]).
- Apply transformations if needed:
-- Log transformation for skewed data.
-- Scaling (MinMaxScaler, StandardScaler).

### 5️⃣ Feature Engineering & Encoding

💡 Goal: Transform data into a format suitable for machine learning

✅ Steps:

📌 Encoding categorical variables

- One-hot encoding (pd.get_dummies(df, columns=['categorical_column'])).
- Label encoding (sklearn.preprocessing.LabelEncoder()).

📌 Feature Engineering

- Create new features (e.g., age groups from age).
- Extract date components (year, month, etc.).

📌 Dimensionality Reduction (if needed)

- PCA (Principal Component Analysis) for high-dimensional data.

### 6️⃣ Splitting Data for Training & Model Preparation

💡 Goal: Prepare dataset for modeling

✅ Steps:

- Separate features and target (X = df.drop(columns=['target']), y = df['target']).
- Train-test split (train_test_split(X, y, test_size=0.2, random_state=42)).
- Check class distribution (for imbalanced data).
- Balance data if necessary (SMOTE, oversampling).