**Original data source **


https://archive.ics.uci.edu/dataset/76/nursery

## Exploratory Analysis
To begin this exploratory analysis, first import libraries and define functions for plotting the data using `matplotlib`. Depending on the data, not all plots will be made)

In [None]:
from mpl_toolkits.mplot3d import Axes3D
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt # plotting
import numpy as np # linear algebra
import os # accessing directory structure
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)


There is 1 csv file in the current version of the dataset:


In [None]:
print(os.listdir('../input'))

In [None]:
df = pd.read_csv("/kaggle/input/nursery_data.csv")


In [None]:
# Original dataframe doesn't include title for each column, what makes it <br> impossible to  intepret, therefore I add here column names


df.columns = ["parents", "has_nurs", "form", "children", "housing", "finance", "social", "health", "class"]

In [None]:
df.shape

In [None]:
df.head()

In [None]:
# We have different types of columns as we can see, we will need to prepare them differently
df.dtypes

In [None]:
#from the above dtype check, the number of children 'children' column show
#dtype as <br> 
#object, let's see the unique value of this
print(df['children'].unique())

In [None]:
for col in df.columns:
    print(f"Column: {col}")
    print(df[col].unique())
    print("-" * 40)

Great, no NA 

**my idea is, from this dataset, obviously we can see the landscape of families for children born in the year. family size, economic situation, health condition. We can plot the data just to investigate to social economic pictures of the region**

**origignally, dataset was made for ranking purpose and admision while the city had to many application to daycare. In today reality, we dont need to try hand pick children to see who is qualified for education. But the classification and other analysis can be made to understand the social factor and how to educate them at daycare. see what do they need more, how can school join hand with parents to help children develop fully and multifacetedly**


In [None]:
df.isnull().sum()

perfect, data is clean


In [None]:
for col in df.columns:
    plt.figure(figsize=(6,4))
    if df[col].dtype == 'object' or str(df[col].dtype) == 'category':
        df[col].value_counts().plot(kind='bar')
        plt.title(f"Value Counts of {col}")
        plt.xlabel(col)
        plt.ylabel("Count")
        plt.text(value_counts)
        plt.show()
           for count, patch in zip(counts, patches):
            plt.text(patch.get_x() + patch.get_width()/2, count, int(count),
                     ha='center', va='bottom')
    else:
        df[col].hist(bins=10)
        plt.title(f"Histogram of {col}")
        plt.xlabel(col)
        plt.ylabel("Frequency")
        plt.show()

In [None]:
# Distribution graphs (histogram/bar graph) of column data
def plotPerColumnDistribution(df, nGraphShown, nGraphPerRow):
    nunique = df.nunique()
    df = df[[col for col in df if nunique[col] > 1 and nunique[col] < 50]] # For displaying purposes, pick columns that have between 1 and 50 unique values
    nRow, nCol = df.shape
    columnNames = list(df)
    nGraphRow = (nCol + nGraphPerRow - 1) / nGraphPerRow
    plt.figure(num = None, figsize = (6 * nGraphPerRow, 8 * nGraphRow), dpi = 80, facecolor = 'w', edgecolor = 'k')
    for i in range(min(nCol, nGraphShown)):
        plt.subplot(nGraphRow, nGraphPerRow, i + 1)
        columnDf = df.iloc[:, i]
        if (not np.issubdtype(type(columnDf.iloc[0]), np.number)):
            valueCounts = columnDf.value_counts()
            valueCounts.plot.bar()
        else:
            columnDf.hist()
        plt.ylabel('counts')
        plt.xticks(rotation = 90)
        plt.title(f'{columnNames[i]} (column {i})')
    plt.tight_layout(pad = 1.0, w_pad = 1.0, h_pad = 1.0)
    plt.show()


In [None]:
# Correlation matrix
def plotCorrelationMatrix(df, graphWidth):
    filename = df.dataframeName
    df = df.dropna('columns') # drop columns with NaN
    df = df[[col for col in df if df[col].nunique() > 1]] # keep columns where there are more than 1 unique values
    if df.shape[1] < 2:
        print(f'No correlation plots shown: The number of non-NaN or constant columns ({df.shape[1]}) is less than 2')
        return
    corr = df.corr()
    plt.figure(num=None, figsize=(graphWidth, graphWidth), dpi=80, facecolor='w', edgecolor='k')
    corrMat = plt.matshow(corr, fignum = 1)
    plt.xticks(range(len(corr.columns)), corr.columns, rotation=90)
    plt.yticks(range(len(corr.columns)), corr.columns)
    plt.gca().xaxis.tick_bottom()
    plt.colorbar(corrMat)
    plt.title(f'Correlation Matrix for {filename}', fontsize=15)
    plt.show()


In [None]:
# Scatter and density plots
def plotScatterMatrix(df, plotSize, textSize):
    df = df.select_dtypes(include =[np.number]) # keep only numerical columns
    # Remove rows and columns that would lead to df being singular
    df = df.dropna('columns')
    df = df[[col for col in df if df[col].nunique() > 1]] # keep columns where there are more than 1 unique values
    columnNames = list(df)
    if len(columnNames) > 10: # reduce the number of columns for matrix inversion of kernel density plots
        columnNames = columnNames[:10]
    df = df[columnNames]
    ax = pd.plotting.scatter_matrix(df, alpha=0.75, figsize=[plotSize, plotSize], diagonal='kde')
    corrs = df.corr().values
    for i, j in zip(*plt.np.triu_indices_from(ax, k = 1)):
        ax[i, j].annotate('Corr. coef = %.3f' % corrs[i, j], (0.8, 0.2), xycoords='axes fraction', ha='center', va='center', size=textSize)
    plt.suptitle('Scatter and Density Plot')
    plt.show()


Now you're ready to read in the data and use the plotting functions to visualize the data.

### Let's check 1st file: ../input/nursery_data.csv

In [None]:
nRowsRead = 1000 # specify 'None' if want to read whole file
# nursery_data.csv has 12959 rows in reality, but we are only loading/previewing the first 1000 rows
df1 = pd.read_csv('../input/nursery_data.csv', delimiter=',', nrows = nRowsRead)
df1.dataframeName = 'nursery_data.csv'
nRow, nCol = df1.shape
print(f'There are {nRow} rows and {nCol} columns')

Let's take a quick look at what the data looks like:

In [None]:
df1.head(5)

Distribution graphs (histogram/bar graph) of sampled columns:

In [None]:
plotPerColumnDistribution(df1, 10, 5)

## Conclusion
This concludes your starter analysis! To go forward from here, click the blue "Fork Notebook" button at the top of this kernel. This will create a copy of the code and environment for you to edit. Delete, modify, and add code as you please. Happy Kaggling!