In [1]:
import pandas as pd

In [2]:
data =pd.read_csv('gene_expression.csv')
data

Unnamed: 0,Gene One,Gene Two,Cancer Present
0,4.3,3.9,1
1,2.5,6.3,0
2,5.7,3.9,1
3,6.1,6.2,0
4,7.4,3.4,1
...,...,...,...
2995,5.0,6.5,1
2996,3.4,6.6,0
2997,2.7,6.5,0
2998,3.3,5.6,0


In [3]:
data.isnull().sum()

Gene One          0
Gene Two          0
Cancer Present    0
dtype: int64

🧬 What is Gene Expression Level?
The expression level of a gene refers to how much that gene is being "used" or "activated" in a particular cell or tissue.

Think of a gene like a recipe in a cookbook 📖. The expression level tells you how often that recipe is being cooked.

🧠 Scientific Definition:
Gene expression level is the amount of RNA (or protein) produced from a gene.
It reflects how actively a gene is transcribed (read and used) in a specific condition.

🎯 Why It Matters:
Genes can be:

Highly expressed → making a lot of protein/RNA (important or active)

Lowly expressed → barely active

Not expressed → silent (turned off)

For example:

A cancer gene might have high expression in tumor cells.

A liver enzyme gene may be active only in liver tissue.

📏 How Is It Measured?
Gene expression is measured using:

Technique	Unit Type	Meaning
Microarrays	Fluorescence Intensity	Brighter = more expression
RNA-seq (NGS)	TPM, RPKM, FPKM	Normalized RNA count per gene
qPCR	Ct value (inversely)	Lower Ct = higher expression

RNA-Seq is the most common in modern data science and bioinformatics.

📈 Example in Machine Learning:
In datasets like gene expression profiles:
Gene1   Gene2   Gene3   Diagnosis
12.4    0.5     8.1     Cancer
3.0     2.4     0.1     Healthy
Here:

Gene1 = 12.4 → highly expressed

Gene2 = 0.5 → low expression

✅ Summary:
Gene expression level = how "on" or "active" a gene is

Measured by how much RNA/protein is produced

Used in biology, disease diagnosis, and machine learning on genomics data





In [4]:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

In [6]:
X= data.drop(['Cancer Present'],axis=1)
Y=data['Cancer Present']

In [7]:
X

Unnamed: 0,Gene One,Gene Two
0,4.3,3.9
1,2.5,6.3
2,5.7,3.9
3,6.1,6.2
4,7.4,3.4
...,...,...
2995,5.0,6.5
2996,3.4,6.6
2997,2.7,6.5
2998,3.3,5.6


In [8]:
Y     #Binary label (1 = cancer, 0 = no cancer)

0       1
1       0
2       1
3       0
4       1
       ..
2995    1
2996    0
2997    0
2998    0
2999    0
Name: Cancer Present, Length: 3000, dtype: int64

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.25, random_state=42)

In [10]:
# 3. Create a KNN classifier instance
# n_neighbors specifies the 'k' in KNN, meaning the number of nearest neighbors to consider
knn = KNeighborsClassifier(n_neighbors=5)

# 4. Train the model using the training data
knn.fit(X_train, y_train)

# 5. Make predictions on the test data
y_pred = knn.predict(X_test)

In [11]:
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of the KNN model: {accuracy:.2f}")

Accuracy of the KNN model: 0.93


In [12]:
# Predict a new data point
new_data_point = [[5.1, 3.5]] # Example: (gene 1)'s expression level and (gene 2)'s expression level
predicted_class = knn.predict(new_data_point)
print("Predicted class:", predicted_class[0])

Predicted class: 1




In [13]:
import joblib
joblib.dump(knn, 'knnforcancer.joblib')

['knnforcancer.joblib']