<a href="https://colab.research.google.com/github/RomyPatel/JioInternship/blob/main/JioInternship.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Loading the dataset

In [None]:
import pandas as pd
df= pd.read_csv('/content/data_cancer.csv')
df

Analysis

In [None]:
df.info()

Cleaning data

In [None]:
df= df.drop(['id'], axis=1)
df= df.dropna()
df

In [None]:
df['diagnosis']= df['diagnosis'].astype('category').cat.codes
df

Reduction of dataset

In [None]:
correlation_matrix= df.corr()
correlation_matrix

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
plt.subplots(figsize=(20,10))
sns.heatmap(correlation_matrix,square=True)

Selected features
- radius mean
- perimeter_mean	
- area_mean
- compactness_mean
- concavity_mean	
- concave points_mean
- radius_se
- perimeter_se	
- area_se
- radius_worst
- perimeter_worst	
- area_worst
- compactness_worst	
- concavity_worst	
- concave points_worst	

In [None]:
df_final= df[['diagnosis', 'radius_mean', 'perimeter_mean', 'area_mean', 'compactness_mean', 'concavity_mean', 'concave points_mean', 'radius_se', 'perimeter_se', 'area_se', 'radius_worst', 'perimeter_worst', 'area_worst', 'compactness_worst', 'concavity_worst', 'concave points_worst' ]]
df_final

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
plt.subplots(figsize=(20,10))
sns.heatmap(df_final.corr(),square=True)

We find that

1) perimeter_mean, area_mean, perimeter_worst and area_worst are interdependent

2) compactness_mean, concavity_mean, concave points_mean, compactness_worst, concavity_worst, concave points_worst are interdependent

3) radius_se, perimeter_se and area_se are interdependent

Statistical analysis

In [None]:
plt.scatter(df['radius_mean'], df['perimeter_mean'], c='r')
plt.scatter(df['radius_mean'], df['area_mean'], c='b')
plt.scatter(df['radius_mean'], df['perimeter_worst'], c='g')
plt.scatter(df['radius_mean'], df['area_worst'], c='y')
plt.show()

So we select radius_mean and reject perimeter_mean, area_mean, perimeter_worst, area_worst

In [None]:
df_final= df_final.drop(['perimeter_mean', 'area_mean', 'perimeter_worst', 'area_worst'], axis= 1)
df_final

In [None]:
plt.scatter(df['concave points_worst'], df['compactness_mean'], c='r')
plt.scatter(df['concave points_worst'], df['concavity_mean'], c='b')
plt.scatter(df['concave points_worst'], df['concave points_mean'], c='g')
plt.scatter(df['concave points_worst'], df['compactness_worst'], c='y')
plt.scatter(df['concave points_worst'], df['concavity_worst'], c='w')
plt.show()

Because of more randomness in data we retain all the features

In [None]:
plt.scatter(df['radius_se'], df['perimeter_se'], c='r')
plt.scatter(df['radius_se'], df['area_se'], c='b')
plt.show()

So we retain radius_se and drop parameter_se and area_se

In [None]:
df_final= df_final.drop(['perimeter_se', 'area_se'], axis= 1)
df_final

Feature and target selection

In [None]:
# input and output
x= df.iloc[:,1:].values
y= df.iloc[:,0].values

Spliting data

In [None]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test= train_test_split(x,y,test_size= 0.25, random_state= 0)
print(x_train.shape)
print(x_test.shape)

In [None]:
import numpy as np
print(np.unique(y_train, return_counts= True))
print(np.unique(y_test, return_counts= True))

Since it is an imbalanced data we will focus on precision and recall 

Normalization

In [None]:
from sklearn.preprocessing import MinMaxScaler
s= MinMaxScaler()
x_train= s.fit_transform(x_train)
x_test= s.transform(x_test)

Building model

In [None]:
from sklearn.linear_model import LogisticRegression
model= LogisticRegression()
model.fit(x_train,y_train)

Prediction

In [None]:
y_pred= model.predict(x_test)
y_pred

Evaluation

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
print(confusion_matrix(y_pred, y_test))

In [None]:
print(accuracy_score(y_pred,y_test)*100)

In [None]:
print(classification_report(y_pred, y_test))