<a href="https://colab.research.google.com/github/Junghun12/test/blob/main/iris%20machine.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task
Load the data from "/content/drive/MyDrive/data/iris_bad.csv", preprocess it, split it into training and test sets, train a Decision Tree model, and evaluate its performance.

## 데이터 로드

### Subtask:
/content/drive/MyDrive/data/iris_bad.csv` 파일에서 데이터를 로드합니다.


**Reasoning**:
Import pandas and load the data from the specified CSV file into a DataFrame.



In [1]:
import pandas as pd

df = pd.read_csv('/content/drive/MyDrive/data/iris_bad.csv')
df.head()

Unnamed: 0,no,sepalLength,sepalWidth,petalLength,petalWidth,species
0,1,5.1,3.5,1.4,0.2,setosa
1,2,4.9,3.0,1.4,0.2,setosa
2,3,4.7,3.2,1.3,0.2,setosa
3,4,4.6,3.1,1.5,0.2,setosa
4,5,5.0,3.6,1.4,0.2,setosa


## 데이터 전처리

### Subtask:
결측치, 이상치 등 데이터의 문제를 처리합니다.


**Reasoning**:
Check for missing values in the DataFrame.



In [2]:
df.isnull().sum()

Unnamed: 0,0
no,0
sepalLength,0
sepalWidth,0
petalLength,2
petalWidth,0
species,0


**Reasoning**:
Fill the missing values in the 'petalLength' column with the mean of the column.



In [3]:
df['petalLength'].fillna(df['petalLength'].mean(), inplace=True)
df.isnull().sum()

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['petalLength'].fillna(df['petalLength'].mean(), inplace=True)


Unnamed: 0,0
no,0
sepalLength,0
sepalWidth,0
petalLength,0
petalWidth,0
species,0


**Reasoning**:
Remove the 'no' column as it is not needed for model training.



In [4]:
df.drop('no', axis=1, inplace=True)
df.head()

Unnamed: 0,sepalLength,sepalWidth,petalLength,petalWidth,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


## 데이터 분할

### Subtask:
데이터를 훈련 세트와 테스트 세트로 분할합니다.


**Reasoning**:
Import the train_test_split function from sklearn.model_selection and separate the features and the target variable.



In [5]:
from sklearn.model_selection import train_test_split

X = df.drop('species', axis=1)
y = df['species']

**Reasoning**:
Split the features and target variable into training and testing sets using train_test_split.



In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

## 모델 선택 및 훈련

### Subtask:
의사 결정 트리 모델을 선택하고 훈련 데이터로 학습시킵니다.


**Reasoning**:
Import the DecisionTreeClassifier and instantiate and train the model using the training data.



In [7]:
from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier()
model.fit(X_train, y_train)

## 모델 평가

### Subtask:
테스트 데이터로 모델 성능을 평가합니다.


**Reasoning**:
Import the necessary metrics from sklearn and calculate the accuracy, precision, recall, and F1 score of the model on the test data.



In [9]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')

print(f'Accuracy: {accuracy:.4f}')
print(f'Precision: {precision:.4f}')
print(f'Recall: {recall:.4f}')
print(f'F1 Score: {f1:.4f}')

Accuracy: 1.0000
Precision: 1.0000
Recall: 1.0000
F1 Score: 1.0000


## Summary:

### Data Analysis Key Findings

*   The dataset initially contained 2 missing values in the 'petalLength' column and an unnecessary 'no' column.
*   Missing values in 'petalLength' were imputed with the mean of the column.
*   The 'no' column was removed from the dataset.
*   The cleaned data was split into training and test sets with a test size of 25%.
*   A Decision Tree Classifier model was trained on the training data.
*   The trained Decision Tree model achieved perfect scores on the test set: Accuracy: 1.0000, Precision: 1.0000, Recall: 1.0000, and F1 Score: 1.0000.

### Insights or Next Steps

*   The perfect scores on the test set might indicate overfitting or a very simple classification task. Further investigation with cross-validation or a more complex dataset could be beneficial.
*   Consider visualizing the Decision Tree to understand the decision rules it learned.


## 요약:

### 주요 데이터 분석 결과

* 데이터셋에는 초기 'petalLength' 컬럼에 2개의 결측치와 불필요한 'no' 컬럼이 포함되어 있었습니다.
* 'petalLength'의 결측치는 해당 컬럼의 평균으로 대체되었습니다.
* 'no' 컬럼은 데이터셋에서 제거되었습니다.
* 정제된 데이터는 25%의 테스트 세트 크기로 훈련 및 테스트 세트로 분할되었습니다.
* 훈련 데이터로 의사 결정 트리 분류 모델이 훈련되었습니다.
* 훈련된 의사 결정 트리 모델은 테스트 세트에서 완벽한 점수를 달성했습니다: 정확도: 1.0000, 정밀도: 1.0000, 재현율: 1.0000, F1 점수: 1.0000.

### 통찰 또는 다음 단계

* 테스트 세트의 완벽한 점수는 과적합 또는 매우 간단한 분류 작업을 나타낼 수 있습니다. 교차 검증 또는 더 복잡한 데이터셋으로 추가 조사가 유익할 수 있습니다.
* 학습된 결정 규칙을 이해하기 위해 의사 결정 트리를 시각화하는 것을 고려할 수 있습니다.

## Summary:

### Key Data Analysis Findings

* The dataset initially contained 2 missing values in the 'petalLength' column and an unnecessary 'no' column.
* The missing values in 'petalLength' were imputed with the mean of the column.
* The 'no' column was removed from the dataset.
* The cleaned data was split into training and testing sets with a test size of 25%.
* A Decision Tree Classifier model was trained on the training data.
* The trained Decision Tree model achieved perfect scores on the test set: Accuracy: 1.0000, Precision: 1.0000, Recall: 1.0000, F1 Score: 1.0000.

### Insights or Next Steps

* The perfect scores on the test set might indicate overfitting or a very simple classification task. Further investigation with cross-validation or a more complex dataset could be beneficial.
* Visualizing the decision tree could be considered to understand the learned decision rules.