# Data Analysis Project - Protein Features

This notebook performs:
1. Data Loading and Preprocessing
2. Exploratory Data Analysis (EDA)
3. Data Imbalance Handling

## 1. Import Required Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Set visualization style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

print("Libraries imported successfully!")

Libraries imported successfully!


## 2. Load Data and Initial Exploration

In [2]:
# Load the CSV file
df = pd.read_csv('protein_features.csv')

print(f"Dataset shape: {df.shape}")
print(f"\nFirst few rows:")
df.head()

Dataset shape: (140271, 46)

First few rows:


Unnamed: 0,ID,Recommended Name,Submitted Names,Gene name,Taxonomic lineage,Cellular components:,Domain:,Family:,Biological process:,Function:,...,Group_Negative_Hydrophobic,Group_Negative_Polar,Group_Negative_Positive,Group_Negative_Negative,Group_Negative_Special,Group_Special_Hydrophobic,Group_Special_Polar,Group_Special_Positive,Group_Special_Negative,Group_Special_Special
0,A0A6G0UGL8_9BILA,receptor protein-tyrosine kinase,,,Eukaryota | Metazoa | Ecdysozoa | Nematoda | C...,Cell membrane | Membrane,Signal | Transmembrane | Transmembrane helix,Belongs to the protein kinase superfamily. Tyr...,Cell adhesion,Kinase | Receptor | Transferase | Tyrosine-pro...,...,0.048984,0.0454,0.010753,0.021505,0.013142,0.023895,0.017921,0.013142,0.005974,0.0
1,A0A2K6SKD5_SAIBB,Tyrosine-protein kinase receptor,,LTK,Eukaryota | Metazoa | Chordata | Craniata | Ve...,Cell membrane | Membrane,Signal | Transmembrane | Transmembrane helix,Belongs to the protein kinase superfamily. Tyr...,,Kinase | Receptor | Transferase | Tyrosine-pro...,...,0.047919,0.018916,0.006305,0.008827,0.01261,0.050441,0.018916,0.011349,0.013871,0.022699
2,A0A836ABD7_SHEEP,"Solute carrier family 2, facilitated glucose t...",,,Eukaryota | Metazoa | Chordata | Craniata | Ve...,Cell membrane | Membrane,Transmembrane | Transmembrane helix,Belongs to the major facilitator superfamily. ...,,GTPase activation,...,0.042553,0.022744,0.007337,0.012472,0.00807,0.036684,0.017608,0.010271,0.005869,0.006603
3,A0A6P3RGQ7_PTEVA,Phosphatidylinositol 5-phosphate 4-kinase type...,,KIF5A,Eukaryota | Metazoa | Chordata | Craniata | Ve...,Cytoplasm | Endoplasmic reticulum,,,Lipid metabolism,Kinase | Transferase,...,0.052257,0.028504,0.014252,0.023753,0.016627,0.04038,0.016627,0.002375,0.009501,0.002375
4,A0A3B3DRK3_ORYME,Tyrosine-protein kinase receptor,,,Eukaryota | Metazoa | Chordata | Craniata | Ve...,Cell membrane | Membrane,Immunoglobulin domain | Leucine-rich repeat | ...,Belongs to the protein kinase superfamily. Tyr...,Differentiation | Neurogenesis,Developmental protein | Kinase | Receptor | Tr...,...,0.05686,0.02843,0.004944,0.009889,0.008653,0.027194,0.017305,0.012361,0.009889,0.001236


In [3]:
# Check data info
print("Dataset Information:")
print(f"Shape: {df.shape}")
print(f"\nColumn names:")
print(df.columns.tolist())
print(f"\nData types:")
df.dtypes

Dataset Information:
Shape: (140271, 46)

Column names:
['ID', 'Recommended Name', 'Submitted Names', 'Gene name', 'Taxonomic lineage', 'Cellular components:', 'Domain:', 'Family:', 'Biological process:', 'Function:', 'Sequence:', 'Length:', 'Molecular weight:', 'Number of interactors:', 'pI', 'Net_Charge_7_4', 'Hydrophobicity_GRAVY', 'Depth_Rank', 'Last_Rank', 'First_Rank', 'Sequence_Chunks', 'Group_Hydrophobic_Hydrophobic', 'Group_Hydrophobic_Polar', 'Group_Hydrophobic_Positive', 'Group_Hydrophobic_Negative', 'Group_Hydrophobic_Special', 'Group_Polar_Hydrophobic', 'Group_Polar_Polar', 'Group_Polar_Positive', 'Group_Polar_Negative', 'Group_Polar_Special', 'Group_Positive_Hydrophobic', 'Group_Positive_Polar', 'Group_Positive_Positive', 'Group_Positive_Negative', 'Group_Positive_Special', 'Group_Negative_Hydrophobic', 'Group_Negative_Polar', 'Group_Negative_Positive', 'Group_Negative_Negative', 'Group_Negative_Special', 'Group_Special_Hydrophobic', 'Group_Special_Polar', 'Group_Special_

ID                                object
Recommended Name                  object
Submitted Names                   object
Gene name                         object
Taxonomic lineage                 object
Cellular components:              object
Domain:                           object
Family:                           object
Biological process:               object
Function:                         object
Sequence:                         object
Length:                            int64
Molecular weight:                  int64
Number of interactors:             int64
pI                               float64
Net_Charge_7_4                   float64
Hydrophobicity_GRAVY             float64
Depth_Rank                         int64
Last_Rank                         object
First_Rank                        object
Sequence_Chunks                   object
Group_Hydrophobic_Hydrophobic    float64
Group_Hydrophobic_Polar          float64
Group_Hydrophobic_Positive       float64
Group_Hydrophobi

In [4]:
df = df[["ID","Recommended Name","Submitted Names", "Gene name", "Taxonomic lineage", "Cellular components:", "Domain:", "Family:", "Biological process:", "Function:", "Sequence:", "Length:", "Molecular weight:", "Number of interactors:", "pI", "Net_Charge_7_4", "Hydrophobicity_GRAVY", "Depth_Rank", "Last_Rank", "First_Rank", "Sequence_Chunks"]]

In [5]:
data = df["Recommended Name"].fillna(df["Submitted Names"])
df.insert(1, "Name", data)

In [6]:
df

Unnamed: 0,ID,Name,Recommended Name,Submitted Names,Gene name,Taxonomic lineage,Cellular components:,Domain:,Family:,Biological process:,...,Length:,Molecular weight:,Number of interactors:,pI,Net_Charge_7_4,Hydrophobicity_GRAVY,Depth_Rank,Last_Rank,First_Rank,Sequence_Chunks
0,A0A6G0UGL8_9BILA,receptor protein-tyrosine kinase,receptor protein-tyrosine kinase,,,Eukaryota | Metazoa | Ecdysozoa | Nematoda | C...,Cell membrane | Membrane,Signal | Transmembrane | Transmembrane helix,Belongs to the protein kinase superfamily. Tyr...,Cell adhesion,...,838,95462,0,5.518918,-17.586081,-0.435322,11,Halicephalobus,Eukaryota,"['MTDRVLDRQC', 'NKALGMESGR', 'IKDSQISASS', 'SF..."
1,A0A2K6SKD5_SAIBB,Tyrosine-protein kinase receptor,Tyrosine-protein kinase receptor,,LTK,Eukaryota | Metazoa | Chordata | Craniata | Ve...,Cell membrane | Membrane,Signal | Transmembrane | Transmembrane helix,Belongs to the protein kinase superfamily. Tyr...,,...,794,84804,0,5.804307,-12.713834,-0.113350,15,Saimiri,Eukaryota,"['MGRWGLLLGW', 'FGAAGAILCS', 'CSQEPFLQSS', 'PR..."
2,A0A836ABD7_SHEEP,"Solute carrier family 2, facilitated glucose t...","Solute carrier family 2, facilitated glucose t...",,,Eukaryota | Metazoa | Chordata | Craniata | Ve...,Cell membrane | Membrane,Transmembrane | Transmembrane helix,Belongs to the major facilitator superfamily. ...,,...,1364,149825,0,8.640005,11.644155,-0.051760,15,Ovis,Eukaryota,"['MELEPGGAAA', 'ALLRQKRAAL', 'RRRGCSFESP', 'ST..."
3,A0A6P3RGQ7_PTEVA,Phosphatidylinositol 5-phosphate 4-kinase type...,Phosphatidylinositol 5-phosphate 4-kinase type...,,KIF5A,Eukaryota | Metazoa | Chordata | Craniata | Ve...,Cytoplasm | Endoplasmic reticulum,,,Lipid metabolism,...,422,47259,0,6.360874,-5.279498,-0.388389,15,Pteropus,Eukaryota,"['MASSSAPPAT', 'VPATTAAPGP', 'GFGFASKTKK', 'KH..."
4,A0A3B3DRK3_ORYME,Tyrosine-protein kinase receptor,Tyrosine-protein kinase receptor,,,Eukaryota | Metazoa | Chordata | Craniata | Ve...,Cell membrane | Membrane,Immunoglobulin domain | Leucine-rich repeat | ...,Belongs to the protein kinase superfamily. Tyr...,Differentiation | Neurogenesis,...,810,92125,0,6.064459,-13.407168,-0.165185,17,Oryzias,Eukaryota,"['MDLWFHSIRI', 'CWWRVLFLMS', 'IFQDYLSSML', 'DC..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
140266,H3D1B6_TETNG,ISL LIM homeobox 2b,,ISL LIM homeobox 2b,,Eukaryota | Metazoa | Chordata | Craniata | Ve...,Nucleus,Homeobox | LIM domain | Repeat,,Transcription,...,362,40457,201,8.289168,3.402710,-0.450829,16,Tetraodon,Eukaryota,"['MVDIIFSSSF', 'LGDMGDHSKK', 'KQGFAMCVGC', 'GS..."
140267,A0A7L0QEG3_SETKR,R-spondin-3,R-spondin-3,,Rspo3,Eukaryota | Metazoa | Chordata | Craniata | Ve...,Secreted,Repeat | Signal,Belongs to the R-spondin family,Sensory transduction | Wnt signaling pathway,...,243,27409,0,9.391770,23.097770,-1.103292,21,Setophaga,Eukaryota,"['VHPNVSQGCQ', 'GGCATCSDYN', 'GCLSCKPRLF', 'FV..."
140268,A0A850V5Z5_9CORV,Insulin,Insulin,,Ins,Eukaryota | Metazoa | Chordata | Craniata | Ve...,Secreted,Signal,Belongs to the insulin family,Carbohydrate metabolism | Glucose metabolism,...,105,11794,0,5.526705,-3.758263,-0.029524,21,Chloropsis,Eukaryota,"['MALWIRSLPL', 'LALLALSSPG', 'SIQGAVNQHL', 'CG..."
140269,A0A8C2VZR5_CHILA,Cartilage intermediate layer protein,,Cartilage intermediate layer protein,CILP,Eukaryota | Metazoa | Chordata | Craniata | Ve...,Extracellular matrix | Secreted,Immunoglobulin domain | Signal,,,...,1190,132470,0,8.589719,14.987649,-0.485042,14,Chinchilla,Eukaryota,"['MEGAEAWLFS', 'FLVLQVASVL', 'PGPSRRENRA', 'HP..."


In [8]:
df.drop(columns=["Recommended Name", "Submitted Names", "Sequence:"], inplace=True)
df

Unnamed: 0,ID,Name,Gene name,Taxonomic lineage,Cellular components:,Domain:,Family:,Biological process:,Function:,Length:,Molecular weight:,Number of interactors:,pI,Net_Charge_7_4,Hydrophobicity_GRAVY,Depth_Rank,Last_Rank,First_Rank,Sequence_Chunks
0,A0A6G0UGL8_9BILA,receptor protein-tyrosine kinase,,Eukaryota | Metazoa | Ecdysozoa | Nematoda | C...,Cell membrane | Membrane,Signal | Transmembrane | Transmembrane helix,Belongs to the protein kinase superfamily. Tyr...,Cell adhesion,Kinase | Receptor | Transferase | Tyrosine-pro...,838,95462,0,5.518918,-17.586081,-0.435322,11,Halicephalobus,Eukaryota,"['MTDRVLDRQC', 'NKALGMESGR', 'IKDSQISASS', 'SF..."
1,A0A2K6SKD5_SAIBB,Tyrosine-protein kinase receptor,LTK,Eukaryota | Metazoa | Chordata | Craniata | Ve...,Cell membrane | Membrane,Signal | Transmembrane | Transmembrane helix,Belongs to the protein kinase superfamily. Tyr...,,Kinase | Receptor | Transferase | Tyrosine-pro...,794,84804,0,5.804307,-12.713834,-0.113350,15,Saimiri,Eukaryota,"['MGRWGLLLGW', 'FGAAGAILCS', 'CSQEPFLQSS', 'PR..."
2,A0A836ABD7_SHEEP,"Solute carrier family 2, facilitated glucose t...",,Eukaryota | Metazoa | Chordata | Craniata | Ve...,Cell membrane | Membrane,Transmembrane | Transmembrane helix,Belongs to the major facilitator superfamily. ...,,GTPase activation,1364,149825,0,8.640005,11.644155,-0.051760,15,Ovis,Eukaryota,"['MELEPGGAAA', 'ALLRQKRAAL', 'RRRGCSFESP', 'ST..."
3,A0A6P3RGQ7_PTEVA,Phosphatidylinositol 5-phosphate 4-kinase type...,KIF5A,Eukaryota | Metazoa | Chordata | Craniata | Ve...,Cytoplasm | Endoplasmic reticulum,,,Lipid metabolism,Kinase | Transferase,422,47259,0,6.360874,-5.279498,-0.388389,15,Pteropus,Eukaryota,"['MASSSAPPAT', 'VPATTAAPGP', 'GFGFASKTKK', 'KH..."
4,A0A3B3DRK3_ORYME,Tyrosine-protein kinase receptor,,Eukaryota | Metazoa | Chordata | Craniata | Ve...,Cell membrane | Membrane,Immunoglobulin domain | Leucine-rich repeat | ...,Belongs to the protein kinase superfamily. Tyr...,Differentiation | Neurogenesis,Developmental protein | Kinase | Receptor | Tr...,810,92125,0,6.064459,-13.407168,-0.165185,17,Oryzias,Eukaryota,"['MDLWFHSIRI', 'CWWRVLFLMS', 'IFQDYLSSML', 'DC..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
140266,H3D1B6_TETNG,ISL LIM homeobox 2b,,Eukaryota | Metazoa | Chordata | Craniata | Ve...,Nucleus,Homeobox | LIM domain | Repeat,,Transcription,Developmental protein | DNA-binding,362,40457,201,8.289168,3.402710,-0.450829,16,Tetraodon,Eukaryota,"['MVDIIFSSSF', 'LGDMGDHSKK', 'KQGFAMCVGC', 'GS..."
140267,A0A7L0QEG3_SETKR,R-spondin-3,Rspo3,Eukaryota | Metazoa | Chordata | Craniata | Ve...,Secreted,Repeat | Signal,Belongs to the R-spondin family,Sensory transduction | Wnt signaling pathway,Heparin-binding,243,27409,0,9.391770,23.097770,-1.103292,21,Setophaga,Eukaryota,"['VHPNVSQGCQ', 'GGCATCSDYN', 'GCLSCKPRLF', 'FV..."
140268,A0A850V5Z5_9CORV,Insulin,Ins,Eukaryota | Metazoa | Chordata | Craniata | Ve...,Secreted,Signal,Belongs to the insulin family,Carbohydrate metabolism | Glucose metabolism,Hormone,105,11794,0,5.526705,-3.758263,-0.029524,21,Chloropsis,Eukaryota,"['MALWIRSLPL', 'LALLALSSPG', 'SIQGAVNQHL', 'CG..."
140269,A0A8C2VZR5_CHILA,Cartilage intermediate layer protein,CILP,Eukaryota | Metazoa | Chordata | Craniata | Ve...,Extracellular matrix | Secreted,Immunoglobulin domain | Signal,,,,1190,132470,0,8.589719,14.987649,-0.485042,14,Chinchilla,Eukaryota,"['MEGAEAWLFS', 'FLVLQVASVL', 'PGPSRRENRA', 'HP..."


In [9]:
with open('protein_features_cleaned_combine.csv', 'w') as f:
    df.to_csv(f, index=False)