**Original data source **


https://archive.ics.uci.edu/dataset/76/nursery

## Exploratory Analysis
To begin this exploratory analysis, first import libraries and define functions for plotting the data using `matplotlib`. Depending on the data, not all plots will be made)

In [None]:
import matplotlib.pyplot as plt # plotting
import numpy as np # linear algebra
import os # accessing directory structure
import pandas as pd

In [None]:
from sklearn import preprocessing
from sklearn.preprocessing import OneHotEncoder 
from sklearn.cluster import KMeans 
from sklearn.metrics import silhouette_score 

There is 1 csv file in the current version of the dataset:


In [None]:
print(os.listdir('../input'))

In [None]:
df = pd.read_csv("/kaggle/input/nursery_data.csv")


In [None]:
# Original dataframe doesn't include title for each column, what makes it <br> <br> impossible to  intepret, therefore I add here column names


df.columns = ["parents", "has_nurs", "form", "children", "housing", "finance", "social", "health", "class"]

###class column values
class value	Meaning / closest interpretation

1. not_recom	Child is rejected — not recommended for admission
2. recommend Child is accepted — regular acceptance
3. very_recom Child is highly recommended / top priority — very strong case for admission
4. priority. Child is given priority — admission is prioritized but not topmost
5. spec_prior.Child has special priority — may include special needs, exceptional cases, very high prior


In [None]:
df.shape

12959 rows and 9 columns. We can process the whole dataset

In [None]:
df.head()

In [None]:
# We have different types of columns as we can see, we will need to prepare them differently
df.dtypes

All columns' dtype are object, but from above line, I noticed column "children" seems to show number of childre. Let's investiage more to see if we need to convert dtype

In [None]:
#from the above dtype check, the number of children 'children' column show
#dtype as <br> 
#object, let's see the unique value of this
print(df['children'].unique())

Values for the "children" column are 1,2,3 and More. So not numerical value

In [None]:
for col in df.columns:
    print(f"Column: {col}")
    print(df[col].unique())
    print("-" * 40)

The dataset is fictious dataset, target is to classification and selection of children to be accepted to daycare. Provided that number of daycare places are limited and not all children are accepted. The target column here is there for "class"

In [None]:
df.isnull().sum()

perfect, data is clean


**my idea is, from this dataset, obviously we can see the landscape of families for children born in the year. family size, economic situation, health condition. We can plot the data just to investigate to social economic pictures of the region**

**origignally, dataset was made for ranking purpose and admision while the city had to many application to daycare. In today reality, we dont need to try hand pick children to see who is qualified for education. But the classification and other analysis can be made to understand the social factor and how to educate them at daycare. see what do they need more, how can school join hand with parents to help children develop fully and multifacetedly**


In [None]:
for col in df.columns:
    plt.figure(figsize=(6,4))
    
    if df[col].dtype == 'object' or str(df[col].dtype) == 'category':
        # Create bar plot and capture the Axes object
        ax = df[col].value_counts().plot(kind='bar')
        
        plt.title(f"Value Counts of {col}")
        plt.xlabel(col)
        plt.ylabel("Count")

        # Add value labels on top of bars
        for patch in ax.patches:
            count = int(patch.get_height())
            plt.text(patch.get_x() + patch.get_width()/2, count, count,
                     ha='center', va='bottom')

        plt.show()

    else:
        df[col].hist(bins=10)
        plt.title(f"Histogram of {col}")
        plt.xlabel(col)
        plt.ylabel("Frequency")
        plt.show()

Above columns show equal portions for each category, which doesn't tell much about the demographic as I intended. Let's try an unsupervised K cluster analysis for discovery grouping taken all feature into consideration

First we should encode the categorical data and delete the target column "Class"

In [None]:
X = pd.get_dummies(df)
X.head()

In [None]:
X=pd.get_dummies(df.drop('class', axis=1))
X_cluster= X.copy()
X_logit= X_cluster.assign(class_=df['class'])

In [None]:
X_logit.head()

In [None]:
# Now that we know how to build one model we can select the optimal k value
# for this, we can iterate over k values, record the quality of the model for each k
sse_clust = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 42)
    kmeans.fit(X_cluster)
    sse_clust.append(kmeans.inertia_)

In [None]:
plt.plot(range(1, 11), sse_clust)
plt.title('The Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('SSE')
plt.show()

I decided to use 6 Kclusters

In [None]:
# Fit K-Means
kmeans = KMeans(n_clusters=6, random_state=42)
labels = kmeans.fit_predict(X_cluster)

# Evaluate clustering
score = silhouette_score(X_cluster, labels)
print("Silhouette Score:", score)

# Optional: Compare clusters with real classes
comparison = pd.crosstab(df['class'], labels)
print(comparison)

In [None]:
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_cluster)

plt.figure(figsize=(8,6))
plt.scatter(X_pca[:,0], X_pca[:,1], c=labels, cmap='tab10', s=10)
plt.title("K-Means Clusters (PCA projection)")
plt.xlabel("PCA 1")
plt.ylabel("PCA 2")
plt.colorbar(label="Cluster")
plt.show()

together with low silhouette score and unclear grouping. The K cluster is not suitable for this data set

In [None]:
#try classification by regression and decision tree 
#first, let's import libraries
# Classification performance evaluation
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report,confusion_matrix

# Logistic regression
from sklearn.linear_model import LogisticRegression

# Decision trees
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
import graphviz
from sklearn.ensemble import BaggingClassifier

# Random forest classifier
from sklearn.ensemble import RandomForestClassifier

# Grid search
from sklearn.model_selection import GridSearchCV

In [None]:
#test set and train set split
col_list = list(X_logit.columns)
col_list.remove('class_')
X = X_logit[col_list]
y = X_logit['class_']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

In [None]:
X.head()

In [None]:
# Train and evaluate logistic rgeression model
# We have to increase iterations

data_logistic = LogisticRegression(max_iter = 500)
data_logistic.fit(X_train,y_train)

pred_logistic = data_logistic.predict(X_test)
print(confusion_matrix(y_test,pred_logistic))

# The results are not very good, similar problem we faced before, one class is much smaller

print(classification_report(y_test,pred_logistic))

the precision, recall and therefore  f1_score are high for Not_recommendation, prority and spec_priority but the whole group of very_recom was mid-labeled to priority. The result of this is the child who should get the nursing place right away will be put to waiting list.

In [None]:
from sklearn.metrics import confusion_matrix, classification_report
import seaborn as sns
import matplotlib.pyplot as plt

# Compute confusion matrix
cm = confusion_matrix(y_test, pred_logistic)

# Define class names (replace with your actual class labels)
class_names = ['not_recom', 'priority', 'spec_prior', 'very_recom']

# Create heatmap
plt.figure(figsize=(6, 5))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=class_names,
            yticklabels=class_names)

plt.title("Confusion Matrix")
plt.xlabel("Predicted Labels")
plt.ylabel("True Labels")
plt.show()

# Print classification report
print(classification_report(y_test, pred_logistic, target_names=class_names))

## Conclusion!