# Spinal Muscular Atrophy (SMA) â€“ Background and Data Overview


## Day 1

## Biological Background

Spinal Muscular Atrophy (SMA) is a rare genetic disorder caused by mutations in the SMN1 gene.
It leads to degeneration of motor neurons, resulting in muscle weakness.

The number of SMN2 gene copies influences disease severity.
Patients with more SMN2 copies usually have milder symptoms.


## Dataset Description

This project uses a clinical-style dataset related to SMA patients.
The dataset includes patient age, disease type, genetic features,
and motor function scores.

Target variable: disease severity.


## Planned Data Columns

- age: Patient age in years
- sma_type: Type of SMA (I, II, III)
- smn2_copies: Number of SMN2 gene copies
- motor_score: Motor function assessment score
- severity: Disease severity level


## Day 2

In [1]:
## Importing libraries
import pandas as pd

In [2]:
## Loading the dataset
df = pd.read_csv("D:/My Projects/sma-data-science/data/raw/sma_clinical_data.csv")

In [7]:
## Basic Info
df.shape
df.columns
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   age          10 non-null     int64 
 1   sma_type     10 non-null     object
 2   smn2_copies  10 non-null     int64 
 3   motor_score  10 non-null     int64 
 4   severity     10 non-null     object
dtypes: int64(3), object(2)
memory usage: 532.0+ bytes


In [9]:
## Preview Data
df.head()

Unnamed: 0,age,sma_type,smn2_copies,motor_score,severity
0,1,Type I,2,10,severe
1,2,Type I,2,12,severe
2,5,Type II,3,35,moderate
3,7,Type II,3,40,moderate
4,10,Type III,4,65,mild


In [10]:
df.tail()

Unnamed: 0,age,sma_type,smn2_copies,motor_score,severity
5,12,Type III,4,70,mild
6,4,Type II,3,38,moderate
7,9,Type III,4,68,mild
8,3,Type I,2,15,severe
9,6,Type II,3,42,moderate


In [11]:
## Basic Statistics
df.describe()

Unnamed: 0,age,smn2_copies,motor_score
count,10.0,10.0,10.0
mean,5.9,3.0,39.5
std,3.60401,0.816497,22.726636
min,1.0,2.0,10.0
25%,3.25,2.25,20.0
50%,5.5,3.0,39.0
75%,8.5,3.75,59.25
max,12.0,4.0,70.0


## Initial Observations

- The dataset contains clinical and genetic features related to SMA patients.
- There are both numerical and categorical variables.
- The target variable is disease severity.
- No missing values are observed at this stage.


## Day 3

In [3]:
## Checking for missing values
df.isnull().sum()

age            0
sma_type       0
smn2_copies    0
motor_score    0
severity       0
dtype: int64

## Missing Value Check

No missing values are observed in the dataset.


In [4]:
## Validate Data Types
df.dtypes

age             int64
sma_type       object
smn2_copies     int64
motor_score     int64
severity       object
dtype: object

## Data Type Validation

Numerical and categorical variables are correctly represented.
No type conversion is required at this stage.


In [6]:
## Check unique values
df["sma_type"].unique()


array(['Type I', 'Type II', 'Type III'], dtype=object)

In [7]:
df["severity"].unique()


array(['severe', 'moderate', 'mild'], dtype=object)

## Category Validation

The SMA type and severity categories are consistent and meaningful.

In [8]:
## Saving the cleaned dataset
df.to_csv("D:/My Projects/sma-data-science/data/processed/sma_clinical_data_cleaned.csv", index=False)