<a href="https://colab.research.google.com/github/Annettteee/annette-colab-projects/blob/main/DNAStrand.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [11]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder

def lead_data(data_path, target_column, test_size=0.2, random_state=42):
    """
    Loads, preprocesses, and splits data for machine learning.

    Args:
      data_path: Path to the CSV file containing the data.
      target_column: Name of the column containing the target variable.
      test_size: Proportion of the dataset to include in the test split.
      random_state: Seed for random number generation for reproducibility.

    Returns:
      A dictionary containing the following:
        - X_train: Training features.
        - X_test: Testing features.
        - y_train: Training target.
        - y_test: Testing target.
        - scaler: Fitted StandardScaler for future scaling of new data.
        - label_encoder: Fitted LabelEncoder for the target variable (if applicable).

    """

    try:
      # 1. Load the data
        df = pd.read_csv(data_path)
    except FileNotFoundError:
        print(f"Error: File not found at {data_path}")
        return None
    except pd.errors.EmptyDataError:
        print(f"Error: Empty data file at {data_path}")
        return None
    except pd.errors.ParserError:
        print(f"Error: Parsing error. Check the file format at {data_path}")
        return None

    # 2. Handle missing values (example: fill with mean)
    for col in df.columns:
        if df[col].isnull().any():
          if pd.api.types.is_numeric_dtype(df[col]):
              df[col].fillna(df[col].mean(), inplace=True)
          else:
              df[col].fillna(df[col].mode()[0], inplace=True)  # Fill with mode for categorical

    # 3. Separate features (X) and target (y)
    X = df.drop(target_column, axis=1)
    y = df[target_column]

    # 4. Encode categorical features (if any)
    for col in X.columns:
      if not pd.api.types.is_numeric_dtype(X[col]):
        le = LabelEncoder()
        X[col] = le.fit_transform(X[col])

    # 5. Encode the target variable (if categorical)
    if not pd.api.types.is_numeric_dtype(y):
      label_encoder = LabelEncoder()
      y = label_encoder.fit_transform(y)
    else:
      label_encoder = None

    # 6. Scale numerical features
    scaler = StandardScaler()
    X = scaler.fit_transform(X)

    # 7. Split data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state)

    return {
        "X_train": X_train,
        "X_test": X_test,
        "y_train": y_train,
        "y_test": y_test,
        "scaler": scaler,
        "label_encoder": label_encoder
    }

data = lead_data("your_data.csv", "your_target_column")

if data:
    X_train = data["X_train"]
    X_test = data["X_test"]
    y_train = data["y_train"]
    y_test = data["y_test"]

    print("Data loaded, preprocessed, and split successfully.")
    print(f"X_train shape: {X_train.shape}")
    print(f"X_test shape: {X_test.shape}")
    print(f"y_train shape: {y_train.shape}")
    print(f"y_test shape: {y_test.shape}")


Error: File not found at your_data.csv


In [14]:
import pandas as pd

try:
    # Use pd.read_csv to read CSV files
    df = pd.read_csv("/content/amtdb_metadata.csv")
    df.head()
except FileNotFoundError:
    print("Error: The file 'amtdb_metadata.csv' was not found.")
    print("Please upload the 'amtdb_metadata.csv' file to the '/content/' directory in your Colab environment.")
    raise # Re-raise the exception to stop further execution if the file is missing.
except ValueError as e:
    print(f"Error reading file: {e}")
    print("It seems the file format might be incorrect for `pd.read_csv`. Please ensure it's a valid CSV.")
    raise

In [18]:
import pandas as pd
try:
    df = pd.read_csv("/content/amtdb_metadata.csv")
except FileNotFoundError:
    print("Error: The file 'amtdb_metadata.csv' was not found.")
    print("Please upload the 'amtdb_metadata.csv' file to the '/content/' directory in your Colab environment.")
    raise # Re-raise the exception to stop further execution in this cell if the file is missing.

print("Available columns in DataFrame: ", df.columns.tolist())

Available columns in DataFrame:  ['identifier', 'alternative_identifiers', 'country', 'continent', 'region', 'culture', 'epoch', 'group', 'comment', 'latitude', 'longitude', 'sex', 'site', 'site_detail', 'mt_hg', 'ychr_hg', 'year_from', 'year_to', 'date_detail', 'bp', 'c14_lab_code', 'reference_name', 'reference_link', 'data_link', 'c14_sample_tag', 'c14_layer_tag', 'ychr_snps', 'avg_coverage', 'sequence_source', 'mitopatho_alleles', 'mitopatho_positions', 'mitopatho_locus', 'mitopatho_diseases', 'mitopatho_statuses', 'mitopatho_homoplasms', 'mitopatho_heteroplasms']


In [25]:
# This cell previously contained code for analyzing a different dataset.
# The columns 'Average Time', 'Number of Splices', and 'DNA Strand Class'
# are not present in the 'amtdb_metadata.csv' dataset.
# The content of this cell has been removed as it is not applicable to the current analysis.

In [55]:
import pandas as pd

# 1. Define your own values for the features
# Ensure these values are plausible within the context of your dataset
# For categorical features, try to use values that were present in the training data
# or ensure the model can handle new categories if that's a requirement.
new_data_input = pd.DataFrame({
    'country': ['Germany'],
    'continent': ['Europe'],
    'epoch': ['Iron Age'],
    'sex': ['M'],
    'latitude': [51.5],
    'longitude': [10.0]
})

# 2. Apply the same one-hot encoding as done for the training data
# The `categorical_features` list was defined earlier in the notebook.
# We need to ensure that the columns of the new data match the columns of X_train.

# One-hot encode the new data's categorical features
new_data_categorical = pd.get_dummies(new_data_input[categorical_features], drop_first=True)

# Combine with numerical features
new_data_processed = pd.concat([new_data_input[['latitude', 'longitude']], new_data_categorical], axis=1)

# Reindex the new data to align columns with the training data (X)
# This is crucial to ensure the order and presence of columns are identical
# If a new category is introduced in new_data_input, it will be handled by reindex and filled with 0
new_data_aligned = new_data_processed.reindex(columns=X.columns, fill_value=0)

# 3. Make a prediction using the trained model
new_prediction_encoded = model.predict(new_data_aligned)

# 4. Decode the prediction back to the original haplogroup string
# `label_encoder` was fitted on the entire range of mt_hg values earlier
new_prediction_haplogroup = label_encoder.inverse_transform(new_prediction_encoded)

print(f"\nInput Data: ")
display(new_data_input)
print(f"\nPredicted Mitochondrial Haplogroup: {new_prediction_haplogroup[0]}")



Input Data: 


Unnamed: 0,country,continent,epoch,sex,latitude,longitude
0,Germany,Europe,Iron Age,M,51.5,10.0



Predicted Mitochondrial Haplogroup: U3a2a


In [19]:
# Display the first 5 rows of the DataFrame
print("\nFirst 5 rows of the DataFrame:")
display(df.head())


First 5 rows of the DataFrame:


Unnamed: 0,identifier,alternative_identifiers,country,continent,region,culture,epoch,group,comment,latitude,...,ychr_snps,avg_coverage,sequence_source,mitopatho_alleles,mitopatho_positions,mitopatho_locus,mitopatho_diseases,mitopatho_statuses,mitopatho_homoplasms,mitopatho_heteroplasms
0,I1496,Motala1,Hungary,Europe,Pannonia,Linear Pottery,Neolithic,LBK,,47.169998,...,,0.0,bam,12372A;16093C;11467G;16192T;12308G;5460A;16270T,12372;16093;11467;16192;12308;5460;16270,MT-ND5;MT-CR;MT-ND4;MT-CR;MT-TL2;MT-ND2;MT-CR,Altered brain pH / sCJD patients;Cyclic Vomiti...,Reported;Reported;Reported;Reported;Reported;C...,+;-;+;nr;+;+;nr,-;+;-;nr;+;+;nr
1,I1707,Motala2,Jordan,Asia,Near East,PPNB,Neolithic,PPNE,,31.99,...,I2c:FGC18140:28707130A->G;I2c:L596:14197631G->...,20.299999,bam,11467G;12308G;12372A;16189C,11467;12308;12372;16189,MT-ND4;MT-TL2;MT-ND5;MT-CR,Altered brain pH / sCJD patients;CPEO / Stroke...,Reported;Reported;Reported;Reported,+;+;+;+,-;+;-;-
2,I0708,Motala3,Turkey,Asia,Anatolia,Anatolia_Neolithic,Neolithic,NENE,,40.299999,...,I2a1b:M423:19096091G->A;I2a1:P37.2:14491684T->...,0.0,bam,12308G;12372A;11467G;16192T;5460A;16270T,12308;12372;11467;16192;5460;16270,MT-TL2;MT-ND5;MT-ND4;MT-CR;MT-ND2;MT-CR,CPEO / Stroke / CM / Breast & Renal & Prostate...,Reported;Reported;Reported;Reported;Conflictin...,+;+;+;nr;+;nr,+;-;-;nr;+;nr
3,I1710,Motala4,Jordan,Asia,Near East,PPNB,Neolithic,PPNE,,31.99,...,,48.5,bam,12308G;12372A;11467G;16192T;16270T,12308;12372;11467;16192;16270,MT-TL2;MT-ND5;MT-ND4;MT-CR;MT-CR,CPEO / Stroke / CM / Breast & Renal & Prostate...,Reported;Reported;Reported;Reported;Reported,+;+;+;nr;nr,+;-;-;nr;nr
4,I0746,Motala6,Turkey,Asia,Anatolia,Anatolia_Neolithic,Neolithic,NENE,,40.299999,...,I2a1:P37.2:14491684T->C;I2a:L460:7879415A->C;I...,0.0,bam,12308G;12372A;11467G;6480A;16192T;310C;16270T,12308;12372;11467;6480;16192;310;16270,MT-TL2;MT-ND5;MT-ND4;MT-CO1;MT-CR;MT-CR;MT-CR,CPEO / Stroke / CM / Breast & Renal & Prostate...,Reported;Reported;Reported;Reported;Reported;R...,+;+;+;+;nr;<NA>;nr,+;-;-;-;nr;<NA>;nr


In [20]:
# Get a concise summary of the DataFrame, including data types and non-null values
print("\nDataFrame Info:")
df.info()


DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 546 entries, 0 to 545
Data columns (total 36 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   identifier               546 non-null    object 
 1   alternative_identifiers  546 non-null    object 
 2   country                  546 non-null    object 
 3   continent                546 non-null    object 
 4   region                   546 non-null    object 
 5   culture                  546 non-null    object 
 6   epoch                    546 non-null    object 
 7   group                    546 non-null    object 
 8   comment                  190 non-null    object 
 9   latitude                 546 non-null    float64
 10  longitude                546 non-null    float64
 11  sex                      546 non-null    object 
 12  site                     546 non-null    object 
 13  site_detail              32 non-null     object 
 14  mt_hg    

In [21]:
# Check for missing values in each column
print("\nMissing values per column:")
display(df.isnull().sum())


Missing values per column:


Unnamed: 0,0
identifier,0
alternative_identifiers,0
country,0
continent,0
region,0
culture,0
epoch,0
group,0
comment,356
latitude,0


In [22]:
# Generate descriptive statistics for numerical columns
print("\nDescriptive statistics for numerical columns:")
display(df.describe())


Descriptive statistics for numerical columns:


Unnamed: 0,latitude,longitude,year_from,year_to,c14_sample_tag,c14_layer_tag,avg_coverage
count,546.0,546.0,546.0,546.0,546.0,546.0,546.0
mean,43.121721,10.98302,-2895.78022,-2546.415751,0.410256,0.058608,145.10483
std,6.177783,17.138292,2491.539732,2324.555243,0.492331,0.235105,236.310373
min,31.790001,-9.3,-11840.0,-11110.0,0.0,0.0,0.0
25%,38.702499,-2.565,-4287.5,-3913.5,0.0,0.0,0.0
50%,41.981377,3.1083,-2900.0,-2500.0,0.0,0.0,38.175001
75%,48.415,19.66423,-1612.0,-1200.0,1.0,0.0,178.0
max,61.650002,58.18,6010.0,1600.0,1.0,1.0,1554.91687


In [23]:
# Generate descriptive statistics for categorical columns
print("\nDescriptive statistics for categorical columns:")
display(df.describe(include='object'))


Descriptive statistics for categorical columns:


Unnamed: 0,identifier,alternative_identifiers,country,continent,region,culture,epoch,group,comment,sex,...,data_link,ychr_snps,sequence_source,mitopatho_alleles,mitopatho_positions,mitopatho_locus,mitopatho_diseases,mitopatho_statuses,mitopatho_homoplasms,mitopatho_heteroplasms
count,546,546,546,546,546,546,546,546,190,546,...,546,141,546,122,122,122,122,122,115,115
unique,546,544,13,2,11,81,8,37,36,3,...,7,135,1,113,113,112,112,62,99,108
top,I4558,SVP5,Spain,Europe,Iberia,Linear Pottery,Neolithic,CAIB,Middle-Late Neolithic,M,...,https://www.ebi.ac.uk/ena/browser/view/PRJEB30874,G2a2a1a2a:PF3237:17017831G->A;G2a2a1a2a:PF3238...,bam,310C,310,MT-CR,Possible protective factor for normal tension ...,Reported,+;+;+;+;+;+,+;-;-;nr;nr
freq,1,2,279,476,299,43,184,76,43,287,...,263,2,546,7,7,8,7,9,6,2


# Task
Task: Perform data cleaning by addressing missing values in the `mt_hg`, `ychr_hg`, `mitopatho_alleles`, and `mitopatho_diseases` columns of the DataFrame. Specifically, fill missing values in `mt_hg` and `ychr_hg` with 'Unknown', and fill missing values in `mitopatho_alleles` and `mitopatho_diseases` with 'None'.

## Data Preparation and Cleaning

### Subtask:
Address missing values in critical columns such as haplogroups (`mt_hg`, `ychr_hg`), `mitopatho_alleles`, and `mitopatho_diseases`.


**Reasoning**:
The subtask requires addressing missing values in critical columns by filling them with specific string values. I will use the `fillna()` method for each specified column.



In [26]:
df['mt_hg'].fillna('Unknown', inplace=True)
df['ychr_hg'].fillna('Unknown', inplace=True)
df['mitopatho_alleles'].fillna('None', inplace=True)
df['mitopatho_diseases'].fillna('None', inplace=True)

print("Missing values in 'mt_hg' filled with 'Unknown'.")
print("Missing values in 'ychr_hg' filled with 'Unknown'.")
print("Missing values in 'mitopatho_alleles' filled with 'None'.")
print("Missing values in 'mitopatho_diseases' filled with 'None'.")

Missing values in 'mt_hg' filled with 'Unknown'.
Missing values in 'ychr_hg' filled with 'Unknown'.
Missing values in 'mitopatho_alleles' filled with 'None'.
Missing values in 'mitopatho_diseases' filled with 'None'.


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['mt_hg'].fillna('Unknown', inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['ychr_hg'].fillna('Unknown', inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values alw

**Reasoning**:
The previous code produced a FutureWarning related to `inplace=True` with chained assignment. To address this, I will explicitly reassign the result of the `fillna` operation back to the DataFrame column, which ensures the changes are applied correctly and avoids the warning.



In [27]:
df['mt_hg'] = df['mt_hg'].fillna('Unknown')
df['ychr_hg'] = df['ychr_hg'].fillna('Unknown')
df['mitopatho_alleles'] = df['mitopatho_alleles'].fillna('None')
df['mitopatho_diseases'] = df['mitopatho_diseases'].fillna('None')

print("Missing values in 'mt_hg' filled with 'Unknown'.")
print("Missing values in 'ychr_hg' filled with 'Unknown'.")
print("Missing values in 'mitopatho_alleles' filled with 'None'.")
print("Missing values in 'mitopatho_diseases' filled with 'None'.")

Missing values in 'mt_hg' filled with 'Unknown'.
Missing values in 'ychr_hg' filled with 'Unknown'.
Missing values in 'mitopatho_alleles' filled with 'None'.
Missing values in 'mitopatho_diseases' filled with 'None'.


## Parse `mitopatho_diseases` column

### Subtask:
Parse the `mitopatho_diseases` column to extract individual disease entries for analysis.


**Reasoning**:
To understand how to parse the `mitopatho_diseases` column, I need to inspect its unique values and format. This will help identify the delimiters and patterns for splitting the disease entries.



In [28]:
print("Unique values in 'mitopatho_diseases' column:")
print(df['mitopatho_diseases'].unique())
print("\nValue counts for 'mitopatho_diseases' column (top 20 if many unique values):")
print(df['mitopatho_diseases'].value_counts().head(20))

Unique values in 'mitopatho_diseases' column:
['Altered brain pH / sCJD patients;Cyclic Vomiting Syndrome;Altered brain pH / sCJD patients;Melanoma patients;CPEO / Stroke / CM / Breast & Renal & Prostate Cancer Risk / Altered brain pH /sCJD;AD / PD / LHON;Melanoma patients'
 'Altered brain pH / sCJD patients;CPEO / Stroke / CM / Breast & Renal & Prostate Cancer Risk / Altered brain pH /sCJD;Altered brain pH / sCJD patients;Diabetes / Cardiomyopathy /cancer risk / mtDNA copy nbr / Metabolic Syndrome / Melanoma patients'
 'CPEO / Stroke / CM / Breast & Renal & Prostate Cancer Risk / Altered brain pH /sCJD;Altered brain pH / sCJD patients;Altered brain pH / sCJD patients;Melanoma patients;AD / PD / LHON;Melanoma patients'
 'CPEO / Stroke / CM / Breast & Renal & Prostate Cancer Risk / Altered brain pH /sCJD;Altered brain pH / sCJD patients;Altered brain pH / sCJD patients;Melanoma patients;Melanoma patients'
 'CPEO / Stroke / CM / Breast & Renal & Prostate Cancer Risk / Altered brain pH /s

**Reasoning**:
Based on the inspection of unique values, the `mitopatho_diseases` column contains multiple disease entries separated by semicolons. I will split these strings into lists of individual diseases, handling 'None' entries by converting them to empty lists, and store the result in a new column called `individual_diseases`.



In [29]:
df['individual_diseases'] = df['mitopatho_diseases'].apply(lambda x: [disease.strip() for disease in x.split(';') if disease.strip() and disease.strip() != 'None'])

print("New column 'individual_diseases' created with parsed disease entries.")
print("First 5 entries of 'individual_diseases' column:")
print(df['individual_diseases'].head())

New column 'individual_diseases' created with parsed disease entries.
First 5 entries of 'individual_diseases' column:
0    [Altered brain pH / sCJD patients, Cyclic Vomi...
1    [Altered brain pH / sCJD patients, CPEO / Stro...
2    [CPEO / Stroke / CM / Breast & Renal & Prostat...
3    [CPEO / Stroke / CM / Breast & Renal & Prostat...
4    [CPEO / Stroke / CM / Breast & Renal & Prostat...
Name: individual_diseases, dtype: object


## Drop columns with excessive missingness

### Subtask:
Identify and remove columns from the DataFrame that have a high percentage of missing values.


**Reasoning**:
To identify columns with excessive missingness, I will calculate the percentage of missing values for each column.



In [30]:
missing_percentage = df.isnull().sum() / len(df) * 100
drop_columns = missing_percentage[missing_percentage > 70].index.tolist()

print("Columns to be dropped due to >70% missing values:")
if drop_columns:
    for col in drop_columns:
        print(f"- {col}: {missing_percentage[col]:.2f}% missing")
    df.drop(columns=drop_columns, inplace=True)
    print("\nColumns dropped successfully.")
else:
    print("No columns found with more than 70% missing values.")

print(f"\nShape of the DataFrame after dropping columns: {df.shape}")

print("\nDataFrame Info after dropping columns:")
df.info()

Columns to be dropped due to >70% missing values:
- site_detail: 94.14% missing
- c14_lab_code: 75.27% missing
- ychr_snps: 74.18% missing
- mitopatho_positions: 77.66% missing
- mitopatho_locus: 77.66% missing
- mitopatho_statuses: 77.66% missing
- mitopatho_homoplasms: 78.94% missing
- mitopatho_heteroplasms: 78.94% missing

Columns dropped successfully.

Shape of the DataFrame after dropping columns: (546, 29)

DataFrame Info after dropping columns:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 546 entries, 0 to 545
Data columns (total 29 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   identifier               546 non-null    object 
 1   alternative_identifiers  546 non-null    object 
 2   country                  546 non-null    object 
 3   continent                546 non-null    object 
 4   region                   546 non-null    object 
 5   culture                  546 non-null    object 
 6   e

## Address missing values and verify data types for geographical data

### Subtask:
Address missing values in the geographical columns (`latitude`, `longitude`) and verify their data types.


**Reasoning**:
I need to check if there are any missing values in the `latitude` and `longitude` columns and display their current data types to determine if any imputation or type conversion is required. This will help me assess the current state of these columns before proceeding with any data cleaning steps.



In [31]:
print("Missing values in 'latitude':", df['latitude'].isnull().sum())
print("Missing values in 'longitude':", df['longitude'].isnull().sum())
print("Data type of 'latitude':", df['latitude'].dtype)
print("Data type of 'longitude':", df['longitude'].dtype)

Missing values in 'latitude': 0
Missing values in 'longitude': 0
Data type of 'latitude': float64
Data type of 'longitude': float64


## Examine 'comment' and 'bp' columns

### Subtask:
Examine the 'comment' and 'bp' columns to understand their content and determine the best strategy for handling their missing values.


**Reasoning**:
To understand the content and distribution of values in the 'comment' and 'bp' columns, I will print the number of unique values, the value counts, and some non-null sample entries for each.



In [32]:
print("Number of unique values in 'comment':", df['comment'].nunique())
print("\nValue counts for 'comment' column (top 20 if many unique values):\n", df['comment'].value_counts().head(20))
print("\nNon-null sample values from 'comment' column:\n", df['comment'].dropna().sample(min(5, df['comment'].dropna().shape[0]), random_state=42).tolist())

print("\nNumber of unique values in 'bp':", df['bp'].nunique())
print("\nValue counts for 'bp' column (top 20 if many unique values):\n", df['bp'].value_counts().head(20))
print("\nNon-null sample values from 'bp' column:\n", df['bp'].dropna().sample(min(5, df['bp'].dropna().shape[0]), random_state=42).tolist())

Number of unique values in 'comment': 36

Value counts for 'comment' column (top 20 if many unique values):
 comment
Middle-Late Neolithic          43
SE_Iberia_c.10-16CE            26
E_Iberia_IA                    17
SE_Iberia_c.5-8CE              13
NE_Iberia_c.6-8CE_ES           10
NE_Iberia_Greek (Empúries1)     9
NE_Iberia_c.6CE_PL              8
NE_Iberia_c.8-12CE              7
C_Iberia_BA                     6
steppe ancestry                 4
NE_Iberia_RomP                  4
ALPc_MN                         4
SW_Iberia_IA                    4
NE_Iberia_RomP_out              3
N_Iberia_IA                     3
SE_Iberia_c.3-4CE               3
NE_Iberia_RomP (Empúries1)      3
Srubnaya_Outlier                2
NE_Iberia_RomP (Empúries2)      2
NE_Iberia_Hel (Empúries1)       2
Name: count, dtype: int64

Non-null sample values from 'comment' column:
 ['SE_Iberia_c.10-16CE', 'SE_Iberia_c.10-16CE', 'African ancestry', 'SE_Iberia_c.10-16CE', 'NE_Iberia_c.6-8CE_ES']

Number of uniq

**Reasoning**:
Based on the examination of the 'comment' and 'bp' columns, both contain textual information and have a significant number of missing values. To handle these missing values appropriately without losing potential information, I will fill them with descriptive placeholders: 'No Comment' for the 'comment' column and 'Unknown BP' for the 'bp' column.



In [33]:
df['comment'] = df['comment'].fillna('No Comment')
df['bp'] = df['bp'].fillna('Unknown BP')

print("Missing values in 'comment' filled with 'No Comment'.")
print("Missing values in 'bp' filled with 'Unknown BP'.")

print("\nMissing values after filling:")
print("Missing values in 'comment':", df['comment'].isnull().sum())
print("Missing values in 'bp':", df['bp'].isnull().sum())

Missing values in 'comment' filled with 'No Comment'.
Missing values in 'bp' filled with 'Unknown BP'.

Missing values after filling:
Missing values in 'comment': 0
Missing values in 'bp': 0


## Analyze haplogroup distribution by geographical categories

### Subtask:
Analyze the distribution and frequencies of mitochondrial haplogroups (mt_hg) and Y-chromosome haplogroups (ychr_hg) across different geographical categories (country, continent, region).


**Reasoning**:
To begin analyzing the haplogroup distribution, I will first calculate and display the frequency distribution of 'mt_hg' for each unique country. This will provide a foundational understanding of which mitochondrial haplogroups are prevalent in different countries.



In [34]:
mt_hg_country_distribution = df.groupby('country')['mt_hg'].value_counts(normalize=True).unstack(fill_value=0)
print("Mitochondrial Haplogroup (mt_hg) frequency distribution per Country:")
display(mt_hg_country_distribution.head())

Mitochondrial Haplogroup (mt_hg) frequency distribution per Country:


mt_hg,C1,C4a1a+195,G2a4,H,H or H5-C16192T,H+163,H+16311,H1,H1 or H1au1b,H1+16189,...,X2b,X2b+226,X2b4,X2c1,X2d1,X2d2,X2f,X2m2,X2n,no_data
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Armenia,0.0,0.0,0.0,0.111111,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.111111,0.0,0.0,0.0
Germany,0.0,0.0,0.0,0.021978,0.0,0.0,0.010989,0.032967,0.010989,0.0,...,0.0,0.0,0.010989,0.0,0.010989,0.0,0.0,0.0,0.0,0.0
Gibraltar,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Greece,0.0,0.0,0.0,0.181818,0.0,0.090909,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Hungary,0.0,0.0,0.0,0.103448,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.034483,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


**Reasoning**:
Continuing with the haplogroup distribution analysis, I will now calculate and display the frequency distribution of 'mt_hg' for each unique continent and region. This extends the previous step's analysis for countries.



In [35]:
mt_hg_continent_distribution = df.groupby('continent')['mt_hg'].value_counts(normalize=True).unstack(fill_value=0)
print("\nMitochondrial Haplogroup (mt_hg) frequency distribution per Continent:")
display(mt_hg_continent_distribution.head())

mt_hg_region_distribution = df.groupby('region')['mt_hg'].value_counts(normalize=True).unstack(fill_value=0)
print("\nMitochondrial Haplogroup (mt_hg) frequency distribution per Region:")
display(mt_hg_region_distribution.head())


Mitochondrial Haplogroup (mt_hg) frequency distribution per Continent:


mt_hg,C1,C4a1a+195,G2a4,H,H or H5-C16192T,H+163,H+16311,H1,H1 or H1au1b,H1+16189,...,X2b,X2b+226,X2b4,X2c1,X2d1,X2d2,X2f,X2m2,X2n,no_data
continent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Asia,0.0,0.0,0.0,0.028571,0.014286,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.014286,0.014286,0.014286,0.014286,0.0
Europe,0.004202,0.004202,0.002101,0.031513,0.0,0.002101,0.002101,0.039916,0.002101,0.002101,...,0.002101,0.008403,0.002101,0.002101,0.002101,0.0,0.0,0.0,0.0,0.002101



Mitochondrial Haplogroup (mt_hg) frequency distribution per Region:


mt_hg,C1,C4a1a+195,G2a4,H,H or H5-C16192T,H+163,H+16311,H1,H1 or H1au1b,H1+16189,...,X2b,X2b+226,X2b4,X2c1,X2d1,X2d2,X2f,X2m2,X2n,no_data
region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Anatolia,0.0,0.0,0.0,0.0,0.038462,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.038462,0.0,0.038462,0.0,0.0
Baltic,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Caucasus,0.0,0.0,0.0,0.111111,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.111111,0.0,0.0,0.0
Iberia,0.0,0.006689,0.0,0.026756,0.0,0.0,0.0,0.053512,0.0,0.003344,...,0.003344,0.010033,0.0,0.003344,0.0,0.0,0.0,0.0,0.0,0.003344
Middle East,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


**Reasoning**:
To complete the haplogroup distribution analysis, I will now calculate and display the frequency distribution of 'ychr_hg' for each unique country, continent, and region. This aligns with instruction 2 of the subtask.



In [36]:
ychr_hg_country_distribution = df.groupby('country')['ychr_hg'].value_counts(normalize=True).unstack(fill_value=0)
print("\nY-chromosome Haplogroup (ychr_hg) frequency distribution per Country:")
display(ychr_hg_country_distribution.head())

ychr_hg_continent_distribution = df.groupby('continent')['ychr_hg'].value_counts(normalize=True).unstack(fill_value=0)
print("\nY-chromosome Haplogroup (ychr_hg) frequency distribution per Continent:")
display(ychr_hg_continent_distribution.head())

ychr_hg_region_distribution = df.groupby('region')['ychr_hg'].value_counts(normalize=True).unstack(fill_value=0)
print("\nY-chromosome Haplogroup (ychr_hg) frequency distribution per Region:")
display(ychr_hg_region_distribution.head())


Y-chromosome Haplogroup (ychr_hg) frequency distribution per Country:


ychr_hg,-,BT,C1a2,CF,CT,"CT(xE,G,J,LT,R,Q1a,Q1b)",CT(xG),"CT(xG,xR)",E1b1,E1b1a1,...,"R1b1a1a2a1a2(xR1b1a1a2a1a2c,xR1b1a1a2a1a2a5)","R1b1a1a2a1a2(xR1b1a1a2a1a2c,xR1b1a1a2a1a2b1,xR1b1a1a2a1a2a5)",R1b1a1a2a1a2a1,R1b1a1a2a1a2a5,R1b1a1a2a1a2c,R1b1a1a2a2,"T(xT1a1,T1a2a)",T1a,Unknown,nodata
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Armenia,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.555556,0.0
Germany,0.0,0.0,0.0,0.0,0.021978,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.010989,0.67033,0.0
Gibraltar,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0
Greece,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.727273,0.0
Hungary,0.0,0.0,0.068966,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.413793,0.0



Y-chromosome Haplogroup (ychr_hg) frequency distribution per Continent:


ychr_hg,-,BT,C1a2,CF,CT,"CT(xE,G,J,LT,R,Q1a,Q1b)",CT(xG),"CT(xG,xR)",E1b1,E1b1a1,...,"R1b1a1a2a1a2(xR1b1a1a2a1a2c,xR1b1a1a2a1a2a5)","R1b1a1a2a1a2(xR1b1a1a2a1a2c,xR1b1a1a2a1a2b1,xR1b1a1a2a1a2a5)",R1b1a1a2a1a2a1,R1b1a1a2a1a2a5,R1b1a1a2a1a2c,R1b1a1a2a2,"T(xT1a1,T1a2a)",T1a,Unknown,nodata
continent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Asia,0.014286,0.0,0.014286,0.0,0.042857,0.014286,0.0,0.0,0.014286,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.014286,0.0,0.428571,0.014286
Europe,0.002101,0.006303,0.006303,0.002101,0.016807,0.0,0.002101,0.002101,0.0,0.002101,...,0.002101,0.002101,0.002101,0.002101,0.004202,0.014706,0.0,0.002101,0.476891,0.0



Y-chromosome Haplogroup (ychr_hg) frequency distribution per Region:


ychr_hg,-,BT,C1a2,CF,CT,"CT(xE,G,J,LT,R,Q1a,Q1b)",CT(xG),"CT(xG,xR)",E1b1,E1b1a1,...,"R1b1a1a2a1a2(xR1b1a1a2a1a2c,xR1b1a1a2a1a2a5)","R1b1a1a2a1a2(xR1b1a1a2a1a2c,xR1b1a1a2a1a2b1,xR1b1a1a2a1a2a5)",R1b1a1a2a1a2a1,R1b1a1a2a1a2a5,R1b1a1a2a1a2c,R1b1a1a2a2,"T(xT1a1,T1a2a)",T1a,Unknown,nodata
region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Anatolia,0.038462,0.0,0.038462,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.384615,0.0
Baltic,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Caucasus,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.555556,0.0
Iberia,0.0,0.010033,0.003344,0.003344,0.020067,0.0,0.003344,0.003344,0.0,0.003344,...,0.003344,0.003344,0.003344,0.003344,0.006689,0.0,0.0,0.0,0.441472,0.0
Middle East,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.6,0.0


**Reasoning**:
To fulfill the third instruction of the subtask, I will display the top 5 most frequent `mt_hg` and `ychr_hg` haplogroups for a few selected countries, continents, and regions. This will provide a concise initial understanding of the dominant haplogroups in specific geographical areas.



In [37]:
print("\nTop 5 Mitochondrial Haplogroups (mt_hg) for selected geographical categories:")
# Example for a Country (e.g., Germany)
if 'Germany' in mt_hg_country_distribution.index:
    print("\nGermany:")
    display(mt_hg_country_distribution.loc['Germany'].nlargest(5))

# Example for a Continent (e.g., Europe)
if 'Europe' in mt_hg_continent_distribution.index:
    print("\nEurope:")
    display(mt_hg_continent_distribution.loc['Europe'].nlargest(5))

# Example for a Region (e.g., Iberia)
if 'Iberia' in mt_hg_region_distribution.index:
    print("\nIberia:")
    display(mt_hg_region_distribution.loc['Iberia'].nlargest(5))


print("\nTop 5 Y-chromosome Haplogroups (ychr_hg) for selected geographical categories:")
# Example for a Country (e.g., Germany)
if 'Germany' in ychr_hg_country_distribution.index:
    print("\nGermany:")
    display(ychr_hg_country_distribution.loc['Germany'].nlargest(5))

# Example for a Continent (e.g., Europe)
if 'Europe' in ychr_hg_continent_distribution.index:
    print("\nEurope:")
    display(ychr_hg_continent_distribution.loc['Europe'].nlargest(5))

# Example for a Region (e.g., Iberia)
if 'Iberia' in ychr_hg_region_distribution.index:
    print("\nIberia:")
    display(ychr_hg_region_distribution.loc['Iberia'].nlargest(5))


Top 5 Mitochondrial Haplogroups (mt_hg) for selected geographical categories:

Germany:


Unnamed: 0_level_0,Germany
mt_hg,Unnamed: 1_level_1
T2b,0.087912
H1,0.032967
H3,0.032967
J1c5,0.032967
N1a1a1a3,0.032967



Europe:


Unnamed: 0_level_0,Europe
mt_hg,Unnamed: 1_level_1
H1,0.039916
H,0.031513
H3,0.029412
T2b,0.02521
J1c1,0.023109



Iberia:


Unnamed: 0_level_0,Iberia
mt_hg,Unnamed: 1_level_1
H1,0.053512
H3,0.036789
U5b1+16189+@16192,0.036789
J1c1,0.033445
H,0.026756



Top 5 Y-chromosome Haplogroups (ychr_hg) for selected geographical categories:

Germany:


Unnamed: 0_level_0,Germany
ychr_hg,Unnamed: 1_level_1
Unknown,0.67033
G2a2a,0.087912
R1a,0.043956
CT,0.021978
R1b1a1a2,0.021978



Europe:


Unnamed: 0_level_0,Europe
ychr_hg,Unnamed: 1_level_1
Unknown,0.476891
I,0.02521
R1b1a1a2,0.023109
G2a2a,0.021008
R1b1a1a2a1a2,0.021008



Iberia:


Unnamed: 0_level_0,Iberia
ychr_hg,Unnamed: 1_level_1
Unknown,0.441472
I,0.033445
R1b1a1a2a1a,0.0301
R1b1a1a2a1a2,0.0301
CT,0.020067


## Analyze haplogroup distribution by time periods

### Subtask:
Analyze the distribution and frequencies of mitochondrial haplogroups (mt_hg) and Y-chromosome haplogroups (ychr_hg) across different time periods (epoch).


**Reasoning**:
To analyze the haplogroup distribution by time periods, I will calculate and display the frequency distributions of 'mt_hg' and 'ychr_hg' for each unique 'epoch', as specified in instructions 1 through 6.



In [38]:
mt_hg_epoch_distribution = df.groupby('epoch')['mt_hg'].value_counts(normalize=True).unstack(fill_value=0)
print("Mitochondrial Haplogroup (mt_hg) frequency distribution per Epoch:")
display(mt_hg_epoch_distribution.head())

ychr_hg_epoch_distribution = df.groupby('epoch')['ychr_hg'].value_counts(normalize=True).unstack(fill_value=0)
print("\nY-chromosome Haplogroup (ychr_hg) frequency distribution per Epoch:")
display(ychr_hg_epoch_distribution.head())


Mitochondrial Haplogroup (mt_hg) frequency distribution per Epoch:


mt_hg,C1,C4a1a+195,G2a4,H,H or H5-C16192T,H+163,H+16311,H1,H1 or H1au1b,H1+16189,...,X2b,X2b+226,X2b4,X2c1,X2d1,X2d2,X2f,X2m2,X2n,no_data
epoch,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Bronze Age,0.00885,0.0,0.0,0.026549,0.0,0.00885,0.0,0.026549,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.00885,0.0,0.00885,0.0
Classical Age,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.058824,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Copper Age,0.0,0.0,0.0,0.038095,0.0,0.0,0.0,0.028571,0.0,0.0,...,0.0,0.009524,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Epipaleolithic,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Iron Age,0.0,0.0,0.04,0.08,0.0,0.0,0.0,0.16,0.0,0.0,...,0.04,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0



Y-chromosome Haplogroup (ychr_hg) frequency distribution per Epoch:


ychr_hg,-,BT,C1a2,CF,CT,"CT(xE,G,J,LT,R,Q1a,Q1b)",CT(xG),"CT(xG,xR)",E1b1,E1b1a1,...,"R1b1a1a2a1a2(xR1b1a1a2a1a2c,xR1b1a1a2a1a2a5)","R1b1a1a2a1a2(xR1b1a1a2a1a2c,xR1b1a1a2a1a2b1,xR1b1a1a2a1a2a5)",R1b1a1a2a1a2a1,R1b1a1a2a1a2a5,R1b1a1a2a1a2c,R1b1a1a2a2,"T(xT1a1,T1a2a)",T1a,Unknown,nodata
epoch,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Bronze Age,0.00885,0.0,0.0,0.0,0.017699,0.0,0.0,0.0,0.0,0.0,...,0.00885,0.00885,0.00885,0.0,0.0,0.061947,0.0,0.0,0.460177,0.0
Classical Age,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.029412,0.0,0.0,0.0,0.441176,0.0
Copper Age,0.0,0.0,0.0,0.0,0.0,0.0,0.009524,0.009524,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.457143,0.0
Epipaleolithic,0.0,0.0,0.0,0.0,0.333333,0.0,0.0,0.0,0.166667,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0
Iron Age,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.04,0.0,0.0,0.0,0.0,0.44,0.0


**Reasoning**:
To complete the haplogroup distribution analysis by time periods, I will now display the top 5 most frequent `mt_hg` and `ychr_hg` haplogroups for a few selected epochs, as specified in instructions 7 through 10.



In [39]:
print("\nTop 5 Mitochondrial Haplogroups (mt_hg) for selected epochs:")
selected_epochs = ['Bronze Age', 'Copper Age', 'Iron Age'] # Example epochs

for epoch in selected_epochs:
    if epoch in mt_hg_epoch_distribution.index:
        print(f"\nEpoch: {epoch}")
        display(mt_hg_epoch_distribution.loc[epoch].nlargest(5))

print("\nTop 5 Y-chromosome Haplogroups (ychr_hg) for selected epochs:")
for epoch in selected_epochs:
    if epoch in ychr_hg_epoch_distribution.index:
        print(f"\nEpoch: {epoch}")
        display(ychr_hg_epoch_distribution.loc[epoch].nlargest(5))


Top 5 Mitochondrial Haplogroups (mt_hg) for selected epochs:

Epoch: Bronze Age


Unnamed: 0_level_0,Bronze Age
mt_hg,Unnamed: 1_level_1
H,0.026549
H1,0.026549
J1c1,0.026549
K1a3a,0.026549
U5a1,0.026549



Epoch: Copper Age


Unnamed: 0_level_0,Copper Age
mt_hg,Unnamed: 1_level_1
H3,0.047619
H,0.038095
J2a1a1,0.038095
K1a4a1,0.038095
H1,0.028571



Epoch: Iron Age


Unnamed: 0_level_0,Iron Age
mt_hg,Unnamed: 1_level_1
H1,0.16
H,0.08
H1t,0.08
J1c1,0.08
G2a4,0.04



Top 5 Y-chromosome Haplogroups (ychr_hg) for selected epochs:

Epoch: Bronze Age


Unnamed: 0_level_0,Bronze Age
ychr_hg,Unnamed: 1_level_1
Unknown,0.460177
R1b1a1a2a2,0.061947
R1b1a1a2,0.053097
R1b1a1a2a1a2(xR1b1a1a2a1a2c),0.035398
R1a1a1b2,0.026549



Epoch: Copper Age


Unnamed: 0_level_0,Copper Age
ychr_hg,Unnamed: 1_level_1
Unknown,0.457143
I,0.047619
I2a1a1a,0.038095
I2a1b,0.038095
G2a2b2a1a1c1a,0.028571



Epoch: Iron Age


Unnamed: 0_level_0,Iron Age
ychr_hg,Unnamed: 1_level_1
Unknown,0.44
R1b1a1a2a1a,0.12
R1b1a1a2a1a2,0.12
R1b1a1a,0.08
I2a1a1a,0.04


## Mitochondrial Disease Correlation Analysis

### Subtask:
Prepare the `individual_diseases` column for correlation analysis by expanding the list of diseases into separate rows.


**Reasoning**:
To prepare the `individual_diseases` column for correlation analysis, I will expand the list of diseases into separate rows using the `explode()` method and store the result in a new DataFrame.



In [40]:
df_expanded_diseases = df.explode('individual_diseases')

print("DataFrame with individual diseases expanded to separate rows:")
display(df_expanded_diseases.head())
print(f"Shape of the expanded DataFrame: {df_expanded_diseases.shape}")

DataFrame with individual diseases expanded to separate rows:


Unnamed: 0,identifier,alternative_identifiers,country,continent,region,culture,epoch,group,comment,latitude,...,reference_name,reference_link,data_link,c14_sample_tag,c14_layer_tag,avg_coverage,sequence_source,mitopatho_alleles,mitopatho_diseases,individual_diseases
0,I1496,Motala1,Hungary,Europe,Pannonia,Linear Pottery,Neolithic,LBK,No Comment,47.169998,...,Gamba et al. 2014,https://dx.doi.org/10.1038/ncomms6257,https://www.ncbi.nlm.nih.gov/sra?term=SRP039766,1,0,0.0,bam,12372A;16093C;11467G;16192T;12308G;5460A;16270T,Altered brain pH / sCJD patients;Cyclic Vomiti...,Altered brain pH / sCJD patients
0,I1496,Motala1,Hungary,Europe,Pannonia,Linear Pottery,Neolithic,LBK,No Comment,47.169998,...,Gamba et al. 2014,https://dx.doi.org/10.1038/ncomms6257,https://www.ncbi.nlm.nih.gov/sra?term=SRP039766,1,0,0.0,bam,12372A;16093C;11467G;16192T;12308G;5460A;16270T,Altered brain pH / sCJD patients;Cyclic Vomiti...,Cyclic Vomiting Syndrome
0,I1496,Motala1,Hungary,Europe,Pannonia,Linear Pottery,Neolithic,LBK,No Comment,47.169998,...,Gamba et al. 2014,https://dx.doi.org/10.1038/ncomms6257,https://www.ncbi.nlm.nih.gov/sra?term=SRP039766,1,0,0.0,bam,12372A;16093C;11467G;16192T;12308G;5460A;16270T,Altered brain pH / sCJD patients;Cyclic Vomiti...,Altered brain pH / sCJD patients
0,I1496,Motala1,Hungary,Europe,Pannonia,Linear Pottery,Neolithic,LBK,No Comment,47.169998,...,Gamba et al. 2014,https://dx.doi.org/10.1038/ncomms6257,https://www.ncbi.nlm.nih.gov/sra?term=SRP039766,1,0,0.0,bam,12372A;16093C;11467G;16192T;12308G;5460A;16270T,Altered brain pH / sCJD patients;Cyclic Vomiti...,Melanoma patients
0,I1496,Motala1,Hungary,Europe,Pannonia,Linear Pottery,Neolithic,LBK,No Comment,47.169998,...,Gamba et al. 2014,https://dx.doi.org/10.1038/ncomms6257,https://www.ncbi.nlm.nih.gov/sra?term=SRP039766,1,0,0.0,bam,12372A;16093C;11467G;16192T;12308G;5460A;16270T,Altered brain pH / sCJD patients;Cyclic Vomiti...,CPEO / Stroke / CM / Breast & Renal & Prostate...


Shape of the expanded DataFrame: (1091, 29)


## Identify unique diseases and their frequencies

### Subtask:
Identify all unique disease entries and calculate their frequencies within the `individual_diseases` column of the `df_expanded_diseases` DataFrame.


**Reasoning**:
To identify unique disease entries and their frequencies, I will use the `value_counts()` method on the `individual_diseases` column of the `df_expanded_diseases` DataFrame. This will directly provide the count for each unique disease, and I can then display the top 20.



In [41]:
disease_frequencies = df_expanded_diseases['individual_diseases'].value_counts()

print("Top 20 most frequent diseases and their counts:")
display(disease_frequencies.head(20))

Top 20 most frequent diseases and their counts:


Unnamed: 0_level_0,count
individual_diseases,Unnamed: 1_level_1
Altered brain pH / sCJD patients,96
Possible protective factor for normal tension glaucoma,85
Low VO2max response,51
CPEO / Stroke / CM / Breast & Renal & Prostate Cancer Risk / Altered brain pH /sCJD,48
PD protective factor / longevity / altered cell pH / metabolic syndrome / breast cancer risk / Leigh Syndrome risk / ADHD / cognitive decline / SCA2 age of onset / Fuchs endothelial corneal dystrophy,44
Melanoma patients,39
Reduced risk of PD,24
BD-associated / melanoma pts,22
MDD-associated,20
LHON / Insulin Resistance /possible adaptive high altitude variant / miscarriage,19


## Correlate diseases with geographical categories (Country)

### Subtask:
Analyze the association between individual diseases and geographical categories, starting with countries.


**Reasoning**:
To analyze the association between individual diseases and geographical categories, I will create a cross-tabulation (contingency table) of 'individual_diseases' against 'country' using `pd.crosstab()` on the `df_expanded_diseases` DataFrame, store the result in `disease_country_crosstab`, and display its head.



In [42]:
disease_country_crosstab = pd.crosstab(df_expanded_diseases['individual_diseases'], df_expanded_diseases['country'])

print("Cross-tabulation of individual diseases and country (first 5 rows and columns):")
display(disease_country_crosstab.head())

Cross-tabulation of individual diseases and country (first 5 rows and columns):


country,Armenia,Germany,Gibraltar,Greece,Hungary,Iran,Israel,Jordan,Russia,Spain,Sweden,Turkey
individual_diseases,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
AD / PD / LHON,0,5,0,0,1,1,1,0,1,3,0,1
Absence of Endometriosis,0,0,0,0,0,0,0,1,0,1,0,1
Altered brain pH / sCJD patients,4,20,0,10,6,4,0,6,4,24,0,18
Axial myopathy with encephalopathy,0,1,0,0,0,0,0,0,0,0,0,0
BD-associated,0,1,0,0,0,0,0,0,0,0,0,1


**Reasoning**:
To continue analyzing the association between individual diseases and geographical categories, I will now create a cross-tabulation of 'individual_diseases' against 'continent' using `pd.crosstab()` on the `df_expanded_diseases` DataFrame, and store the result in `disease_continent_crosstab`.



In [43]:
disease_continent_crosstab = pd.crosstab(df_expanded_diseases['individual_diseases'], df_expanded_diseases['continent'])

print("Cross-tabulation of individual diseases and continent (first 5 rows and columns):")
display(disease_continent_crosstab.head())

Cross-tabulation of individual diseases and continent (first 5 rows and columns):


continent,Asia,Europe
individual_diseases,Unnamed: 1_level_1,Unnamed: 2_level_1
AD / PD / LHON,3,10
Absence of Endometriosis,2,1
Altered brain pH / sCJD patients,32,64
Axial myopathy with encephalopathy,0,1
BD-associated,1,1


**Reasoning**:
To further analyze the association between individual diseases and geographical categories, I will now create a cross-tabulation of 'individual_diseases' against 'region' using `pd.crosstab()` on the `df_expanded_diseases` DataFrame, and store the result in `disease_region_crosstab`.



In [44]:
disease_region_crosstab = pd.crosstab(df_expanded_diseases['individual_diseases'], df_expanded_diseases['region'])

print("Cross-tabulation of individual diseases and region (first 5 rows and columns):")
display(disease_region_crosstab.head())

Cross-tabulation of individual diseases and region (first 5 rows and columns):


region,Anatolia,Caucasus,Iberia,Middle East,Near East,Pannonia,Pontic steppe,Scandinavia,central Europe,southern Europe
individual_diseases,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
AD / PD / LHON,1,0,3,1,1,1,1,0,5,0
Absence of Endometriosis,0,0,1,0,2,0,0,0,0,0
Altered brain pH / sCJD patients,14,4,24,4,10,6,4,0,20,10
Axial myopathy with encephalopathy,0,0,0,0,0,0,0,0,1,0
BD-associated,0,0,0,0,1,0,0,0,1,0


## Correlate diseases with time periods (epoch)

### Subtask:
Analyze the association between individual diseases and time periods (epoch).


**Reasoning**:
To analyze the association between individual diseases and time periods, I will create a cross-tabulation (contingency table) of 'individual_diseases' against 'epoch' using `pd.crosstab()` on the `df_expanded_diseases` DataFrame, and store the result in `disease_epoch_crosstab`.



In [45]:
disease_epoch_crosstab = pd.crosstab(df_expanded_diseases['individual_diseases'], df_expanded_diseases['epoch'])

print("Cross-tabulation of individual diseases and epoch (first 5 rows and columns):")
display(disease_epoch_crosstab.head())

Cross-tabulation of individual diseases and epoch (first 5 rows and columns):


epoch,Bronze Age,Copper Age,Mesolithic,Middle Ages,Neolithic
individual_diseases,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
AD / PD / LHON,2,0,0,0,11
Absence of Endometriosis,0,1,0,1,1
Altered brain pH / sCJD patients,20,14,0,8,54
Axial myopathy with encephalopathy,0,0,0,0,1
BD-associated,1,0,0,0,1


## Correlate diseases with haplogroups

### Subtask:
Analyze the association between individual diseases and mitochondrial and Y-chromosome haplogroups.


**Reasoning**:
To analyze the association between individual diseases and mitochondrial haplogroups, I will create a cross-tabulation of 'individual_diseases' against 'mt_hg' using `pd.crosstab()` on the `df_expanded_diseases` DataFrame, store the result in `disease_mt_hg_crosstab`, and display its head.



In [46]:
disease_mt_hg_crosstab = pd.crosstab(df_expanded_diseases['individual_diseases'], df_expanded_diseases['mt_hg'])

print("Cross-tabulation of individual diseases and mitochondrial haplogroups (first 5 rows and columns):")
display(disease_mt_hg_crosstab.head())

Cross-tabulation of individual diseases and mitochondrial haplogroups (first 5 rows and columns):


mt_hg,H,H+163,H1,H13a1a2,H1e,H1e1a,H1e1c,H1j,H1j8,H1n+146,...,V1a,V3,V9,W1c'i,W1c1,W3a1a,W6a,X1'2'3,X2c1,X2d2
individual_diseases,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
AD / PD / LHON,0,0,0,0,1,0,1,0,0,0,...,0,0,0,0,0,1,0,1,0,0
Absence of Endometriosis,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Altered brain pH / sCJD patients,8,0,2,0,0,0,0,2,2,0,...,0,0,0,0,2,0,0,0,0,2
Axial myopathy with encephalopathy,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
BD-associated,1,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0


**Reasoning**:
To complete the haplogroup association analysis, I will create a cross-tabulation of 'individual_diseases' against 'ychr_hg' using `pd.crosstab()` on the `df_expanded_diseases` DataFrame, store it in `disease_ychr_hg_crosstab`, and then display its head.



In [47]:
disease_ychr_hg_crosstab = pd.crosstab(df_expanded_diseases['individual_diseases'], df_expanded_diseases['ychr_hg'])

print("Cross-tabulation of individual diseases and Y-chromosome haplogroups (first 5 rows and columns):")
display(disease_ychr_hg_crosstab.head())

Cross-tabulation of individual diseases and Y-chromosome haplogroups (first 5 rows and columns):


ychr_hg,-,BT,C1a2,CT,E1b1b1,E1b1b1a,E1b1b1a1b1,F,G1a,G2,...,R1a1,R1b1a(xR1b1a1a2),R1b1a1a2,R1b1a1a2a,R1b1a1a2a1a,R1b1a1a2a1a(xR1b1a1a2a1a2a5),R1b1a1a2a1a2,R1b1a1a2a1a2(xR1b1a1a2a1a2c),"T(xT1a1,T1a2a)",Unknown
individual_diseases,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
AD / PD / LHON,0,0,1,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,1,0,3
Absence of Endometriosis,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,2
Altered brain pH / sCJD patients,2,0,2,2,0,2,2,0,0,2,...,2,0,4,0,0,2,2,0,2,48
Axial myopathy with encephalopathy,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
BD-associated,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


## Identify key genetic markers for visualization

### Subtask:
Identify a few prominent mitochondrial and Y-chromosome haplogroups to use for geographic visualization.


**Reasoning**:
To identify the most prominent haplogroups, I will calculate and display the top 10 most frequent mitochondrial (mt_hg) and Y-chromosome (ychr_hg) haplogroups using `value_counts()` on the `df` DataFrame.



In [48]:
print("Top 10 most frequent Mitochondrial Haplogroups (mt_hg):")
display(df['mt_hg'].value_counts().head(10))

print("\nTop 10 most frequent Y-chromosome Haplogroups (ychr_hg):")
display(df['ychr_hg'].value_counts().head(10))

Top 10 most frequent Mitochondrial Haplogroups (mt_hg):


Unnamed: 0_level_0,count
mt_hg,Unnamed: 1_level_1
H1,19
H,17
T2b,14
H3,14
U5b1+16189+@16192,11
J1c1,11
K1a+195,9
HV0+195,8
N1a1a1,7
U5b1,7



Top 10 most frequent Y-chromosome Haplogroups (ychr_hg):


Unnamed: 0_level_0,count
ychr_hg,Unnamed: 1_level_1
Unknown,257
I,13
CT,11
R1b1a1a2,11
G2a2a,10
R1b1a1a2a1a2,10
R1b1a1a2a1a,9
J,8
H2,7
R1b1a1a2a2,7


## Geographic Visualization of Genetic Markers

### Subtask:
Visualize the geographic distribution of key mitochondrial and Y-chromosome haplogroups on an interactive map.


**Reasoning**:
To visualize the geographic distribution of key haplogroups, I need to import the `folium` library, define the prominent haplogroups, filter the DataFrame, create an interactive map, and add markers for each relevant data point.



In [49]:
import folium

# 2. Define prominent mitochondrial and Y-chromosome haplogroups
prominent_mt_hg = ['H', 'U5b1', 'T2b']
prominent_ychr_hg = ['R1b1a1a2', 'I2a1a1', 'CT']

# 3. Filter the df DataFrame
filtered_df = df[(df['mt_hg'].isin(prominent_mt_hg)) | (df['ychr_hg'].isin(prominent_ychr_hg))].copy()

# Handle potential missing latitude/longitude after filtering if any (though previous steps indicate none)
filtered_df.dropna(subset=['latitude', 'longitude'], inplace=True)

# 4. Create a Folium map object, centered at an appropriate global coordinate
m = folium.Map(location=[0, 0], zoom_start=2)

# 5. Iterate through the filtered DataFrame and add markers
for index, row in filtered_df.iterrows():
    popup_html = f"""
    <b>Identifier:</b> {row['identifier']}<br>
    <b>Country:</b> {row['country']}<br>
    <b>Mitochondrial HG:</b> {row['mt_hg']}<br>
    <b>Y-chromosome HG:</b> {row['ychr_hg']}
    """

    # Use different colors/icons for mt_hg and ychr_hg for better distinction
    # Prioritize mt_hg color if both are present and prominent
    if row['mt_hg'] in prominent_mt_hg:
        icon_color = 'blue' # Color for Mitochondrial Haplogroups
        icon_prefix = 'fa'
        icon_name = 'dna'
    elif row['ychr_hg'] in prominent_ychr_hg:
        icon_color = 'red' # Color for Y-chromosome Haplogroups
        icon_prefix = 'fa'
        icon_name = 'male'
    else: # Should not happen with current filtering, but as a fallback
        icon_color = 'gray'
        icon_prefix = 'fa'
        icon_name = 'question'

    folium.Marker(
        location=[row['latitude'], row['longitude']],
        popup=folium.Popup(popup_html, max_width=300),
        icon=folium.Icon(color=icon_color, icon_color='white', icon=icon_name, prefix=icon_prefix)
    ).add_to(m)

# 6. Display the created Folium map
display(m)

print("Interactive map showing geographic distribution of prominent haplogroups displayed above.")

Interactive map showing geographic distribution of prominent haplogroups displayed above.


## Predictive Modeling for Genetic Traits

### Subtask:
Develop a machine learning model to predict mitochondrial haplogroups (mt_hg) based on geographical and temporal features.


**Reasoning**:
First, I will prepare the dataset for modeling by selecting the specified features and filtering out 'Unknown' mitochondrial haplogroups as per instruction 1a and 1b.



In [50]:
df_model = df[['country', 'continent', 'epoch', 'sex', 'latitude', 'longitude', 'mt_hg']].copy()
df_model = df_model[df_model['mt_hg'] != 'Unknown']

print("DataFrame 'df_model' prepared with selected features and filtered 'mt_hg'.")
print(f"Shape of df_model: {df_model.shape}")
print("First 5 rows of df_model:")
display(df_model.head())

DataFrame 'df_model' prepared with selected features and filtered 'mt_hg'.
Shape of df_model: (544, 7)
First 5 rows of df_model:


Unnamed: 0,country,continent,epoch,sex,latitude,longitude,mt_hg
0,Hungary,Europe,Neolithic,M,47.169998,19.83,K1a3a3
1,Jordan,Asia,Neolithic,M,31.99,35.98,R0a
2,Turkey,Asia,Neolithic,M,40.299999,29.57,N1b1a
3,Jordan,Asia,Neolithic,M,31.99,35.98,T1a2
4,Turkey,Asia,Neolithic,M,40.299999,29.57,K1a or K1a1


**Reasoning**:
To continue preparing the dataset for modeling, I will encode the target variable `mt_hg` using `LabelEncoder` and then one-hot encode the categorical features (`country`, `continent`, `epoch`, `sex`) using `pd.get_dummies()`. Finally, I will combine these encoded features with the numerical features (`latitude`, `longitude`) to form the feature matrix `X` and define the target vector `y` as `mt_hg_encoded`.



In [51]:
from sklearn.preprocessing import LabelEncoder

# 2a. Initialize a LabelEncoder for the target variable mt_hg
label_encoder = LabelEncoder()
df_model['mt_hg_encoded'] = label_encoder.fit_transform(df_model['mt_hg'])

# 2b. Identify the categorical feature columns
categorical_features = ['country', 'continent', 'epoch', 'sex']

# 2c. Use pd.get_dummies() to one-hot encode these categorical features
X_categorical = pd.get_dummies(df_model[categorical_features], drop_first=True)

# 2d. Combine the one-hot encoded features with the numerical features
X_numerical = df_model[['latitude', 'longitude']]
X = pd.concat([X_numerical, X_categorical], axis=1)

# 2e. Define the target vector y
y = df_model['mt_hg_encoded']

print("Categorical features one-hot encoded and combined with numerical features.")
print(f"Shape of feature matrix X: {X.shape}")
print(f"Shape of target vector y: {y.shape}")
print("First 5 rows of feature matrix X:")
display(X.head())
print("First 5 entries of target vector y:")
display(y.head())

Categorical features one-hot encoded and combined with numerical features.
Shape of feature matrix X: (544, 24)
Shape of target vector y: (544,)
First 5 rows of feature matrix X:


Unnamed: 0,latitude,longitude,country_Germany,country_Gibraltar,country_Greece,country_Hungary,country_Iran,country_Israel,country_Jordan,country_Portugal,...,continent_Europe,epoch_Classical Age,epoch_Copper Age,epoch_Epipaleolithic,epoch_Iron Age,epoch_Mesolithic,epoch_Middle Ages,epoch_Neolithic,sex_M,sex_U
0,47.169998,19.83,False,False,False,True,False,False,False,False,...,True,False,False,False,False,False,False,True,True,False
1,31.99,35.98,False,False,False,False,False,False,True,False,...,False,False,False,False,False,False,False,True,True,False
2,40.299999,29.57,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,True,True,False
3,31.99,35.98,False,False,False,False,False,False,True,False,...,False,False,False,False,False,False,False,True,True,False
4,40.299999,29.57,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,True,True,False


First 5 entries of target vector y:


Unnamed: 0,mt_hg_encoded
0,131
1,158
2,156
3,166
4,114


**Reasoning**:
To prepare the data for model training and evaluation, I will split the feature matrix `X` and target vector `y` into training and testing sets using `train_test_split`, ensuring reproducibility with a `random_state`.



In [52]:
from sklearn.model_selection import train_test_split

# 3. Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Data split into training and testing sets successfully.")
print(f"Shape of X_train: {X_train.shape}")
print(f"Shape of X_test: {X_test.shape}")
print(f"Shape of y_train: {y_train.shape}")
print(f"Shape of y_test: {y_test.shape}")

Data split into training and testing sets successfully.
Shape of X_train: (435, 24)
Shape of X_test: (109, 24)
Shape of y_train: (435,)
Shape of y_test: (109,)


**Reasoning**:
Now that the data is split into training and testing sets, I will train a RandomForestClassifier model using the training data, as specified in instruction 4.



In [53]:
from sklearn.ensemble import RandomForestClassifier

# 4b. Instantiate a RandomForestClassifier
model = RandomForestClassifier(random_state=42)

# 4c. Train the model
model.fit(X_train, y_train)

print("RandomForestClassifier model trained successfully.")

RandomForestClassifier model trained successfully.


**Reasoning**:
To evaluate the trained model, I will make predictions on the test set and then calculate and print the accuracy, precision, recall, and F1-score as specified in instruction 5.



In [54]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# 5a. Make predictions on the test set
y_pred = model.predict(X_test)

# 5c. Calculate and print the overall accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print(f"\nModel Accuracy: {accuracy:.4f}")

# 5d. Calculate and print the precision, recall, and F1-score
precision = precision_score(y_test, y_pred, average='weighted', zero_division=0)
recall = recall_score(y_test, y_pred, average='weighted', zero_division=0)
f1 = f1_score(y_test, y_pred, average='weighted', zero_division=0)

print(f"Weighted Precision: {precision:.4f}")
print(f"Weighted Recall: {recall:.4f}")
print(f"Weighted F1-score: {f1:.4f}")



Model Accuracy: 0.0367
Weighted Precision: 0.0260
Weighted Recall: 0.0367
Weighted F1-score: 0.0291


## Final Task

### Subtask:
Provide a comprehensive summary of the key findings from the genetic population analysis, mitochondrial disease correlation, geographic visualizations, and the performance of the predictive model. Highlight any significant patterns, correlations, or insights discovered from the advanced DNA and data science techniques applied.


## Summary:

### Data Analysis Key Findings

*   **Data Cleaning Effectiveness**: Missing values in haplogroup columns (`mt_hg`, `ychr_hg`) were filled with 'Unknown', and disease-related columns (`mitopatho_alleles`, `mitopatho_diseases`) with 'None'. Eight columns with over 70% missing data (ranging from 74.18% to 94.14%) were removed, reducing the DataFrame from its original size to (546, 29). Geographical coordinates (`latitude`, `longitude`) were found to be complete and correctly typed.
*   **Disease Prevalence**: Upon parsing the `mitopatho_diseases` column, "Altered brain pH / sCJD patients" was identified as the most frequent disease with 96 occurrences, followed by "Possible protective factor for normal tension glaucoma" (85 occurrences), and "Low VO2max response" (51 occurrences).
*   **Haplogroup Distribution by Geography**:
    *   Mitochondrial haplogroups (`mt_hg`) showed distinct geographical patterns, such as `T2b` being most frequent in Germany (8.79%), while `H1` predominated in Europe (3.99%) and Iberia (5.35%).
    *   Y-chromosome haplogroups (`ychr_hg`) had a significant 'Unknown' category across all geographical regions, accounting for 67% in Germany, 47.7% in Europe, and 44.1% in Iberia. Among identified `ychr_hg`, `G2a2a` was most frequent in Germany (8.79%), and `I` was prominent in both Europe (2.52%) and Iberia (3.34%).
*   **Haplogroup Distribution by Epoch**: Similar to geographical distribution, 'Unknown' was highly prevalent in `ychr_hg` across historical epochs (e.g., 46.02% in Bronze Age, 45.71% in Copper Age, 44% in Iron Age). `mt_hg` also showed temporal variations, with `H1` being notably frequent in the Iron Age (16%).
*   **Disease-Haplogroup and Disease-Temporal Correlations**: Cross-tabulations revealed associations between specific diseases and certain haplogroups (e.g., "AD / PD / LHON" with `mt_hg` types like `H1e`, `W3a1a`, and `ychr_hg` types like `C1a2`, `R1b1a1a2a`) and epochs (e.g., "Altered brain pH / sCJD patients" appearing 54 times in Neolithic samples).
*   **Predictive Model Performance**: A RandomForestClassifier model, designed to predict `mt_hg` based on geographical and temporal features, exhibited very low performance metrics: an accuracy of 0.0367, weighted precision of 0.0260, weighted recall of 0.0367, and an F1-score of 0.0291. This indicates the chosen features or model are not effective for predicting mitochondrial haplogroups in this dataset.

### Insights or Next Steps

*   **Improve `ychr_hg` Data Quality**: The high percentage of 'Unknown' Y-chromosome haplogroups significantly hinders genetic population analysis. Future data collection or re-analysis should prioritize obtaining more granular and complete `ychr_hg` data.
*   **Refine Predictive Modeling**: The current predictive model's poor performance suggests that either the selected features are not sufficiently correlated with `mt_hg`, or a more complex model/feature engineering approach is required. Exploring other genetic markers, environmental factors, or advanced ensemble methods could improve prediction accuracy.


##**THANK YOU FOR YOUR INTEREST IN MY PROJECT AND MY LEARNING JOURNEY**