<a href="https://colab.research.google.com/github/HikmahAlBaity/Data_Science/blob/main/Numerical_and_Categorical_Data_Type.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Import data type numerical using Numpy

NumPy, short for "Numerical Python," is a popular open-source library in the Python programming language. It is primarily used for numerical computing and provides support for arrays, matrices, and a large collection of mathematical functions. Key features of NumPy include:

1. N-dimensional Array Object (ndarray): NumPy's primary data structure is the ndarray, a powerful n-dimensional array that allows for efficient storage and manipulation of large datasets.

2. Mathematical Functions: NumPy offers a wide range of mathematical operations and functions, such as element-wise operations, statistical functions, linear algebra operations, Fourier transforms, and more.

3. Broadcasting: NumPy supports broadcasting, a mechanism that allows arrays of different shapes to be used together in arithmetic operations, simplifying code and improving performance.

4. Integration with Other Libraries: NumPy is often used in conjunction with other scientific computing libraries, such as SciPy (for more advanced scientific computations), pandas (for data manipulation), and Matplotlib (for plotting).

5. Performance: NumPy is implemented in C and Fortran, making it much faster than standard Python for numerical computations, especially for large datasets.

Create A Sample Dataset With Various Datatypes Using Numpy:

Types of data (numerical, categorical, and datetime)

# Difference Between np.linspace and np.random function.


**numpy.linspace (np.linspace)**

Purpose: To generate evenly spaced numbers over a specified range.

Usage: numpy.linspace(start, stop, num)

Parameters:

start: The starting value of the sequence.

stop: The end value of the sequence.

num: The number of samples to generate (including the start and stop values).

**numpy.random (np.random)**

Purpose: To generate random numbers.

Usage: numpy.random provides several methods to generate random numbers, including: numpy.random.rand(d0, d1, ..., dn): Uniform distribution over [0, 1). numpy.random.randn(d0, d1, ..., dn): Standard normal distribution (mean 0, variance 1). numpy.random.randint(low, high=None, size=None, dtype='l'): Random integers from low (inclusive) to high (exclusive). numpy.random.normal(loc=0.0, scale=1.0, size=None): Normal (Gaussian) distribution. And many others.

Summary

numpy.linspace: Generates a specified number of evenly spaced values over a given range. Useful for deterministic sequences where the exact values and their spacing are important.

numpy.random: Generates random numbers from various distributions. Useful for creating random samples, simulations, or any situation where randomness is required.

In [1]:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta

#create numerical data

col1 = np.random.normal(250, 10, 500)
col2 = np.random.normal(100, 20, 500)
col3 = np.random.normal(750, 5, 500)

#create categorical data

categorical_data = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J']
col4 = np.random.choice(categorical_data, 500)
col5 = np.random.choice(categorical_data, 500) #use np.random.choice for categorical data

#create date data

start_date = datetime(2022, 1, 1)
end_date = datetime(2024, 1, 1)

delta = end_date - start_date
date_list = [start_date + timedelta(days= x) for x in range(delta.days)]

col6 = np.random.choice(date_list, 500)

#combine all the column/ features

data = np.column_stack((col1, col2, col3, col4, col5, col6)) # array in 2d


# rename the columns
df = pd.DataFrame(data, columns = ['Num1', 'Num2', 'Num3', 'Cat1', 'Cat2','Date'])
df.head()


Unnamed: 0,Num1,Num2,Num3,Cat1,Cat2,Date
0,255.267017,105.080791,749.724206,D,H,2023-05-25
1,264.968164,76.039518,750.148374,J,G,2022-06-04
2,259.428517,90.54574,753.014475,J,A,2023-03-02
3,251.024686,86.48283,753.288838,H,H,2022-05-05
4,246.524448,92.093055,741.949379,H,H,2023-12-16


col1 = np.random.normal(250, 10, 500)

What It Means np.random.normal: This function creates random numbers.

250: The average value of the numbers (most will be around this number).

10: The "spread" or "variation" of the numbers (how much they can differ from the average).

500: How many numbers to create.

In [2]:
df.shape

(500, 6)

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   Num1    500 non-null    object        
 1   Num2    500 non-null    object        
 2   Num3    500 non-null    object        
 3   Cat1    500 non-null    object        
 4   Cat2    500 non-null    object        
 5   Date    500 non-null    datetime64[ns]
dtypes: datetime64[ns](1), object(5)
memory usage: 23.6+ KB


In [5]:

df.describe() # appear statistic details for numerical data

Unnamed: 0,Date
count,500
mean,2023-01-07 20:55:40.800000
min,2022-01-02 00:00:00
25%,2022-07-13 12:00:00
50%,2023-01-12 12:00:00
75%,2023-07-04 12:00:00
max,2023-12-29 00:00:00


In [6]:
df.describe(include = 'all')

Unnamed: 0,Num1,Num2,Num3,Cat1,Cat2,Date
count,500.0,500.0,500.0,500,500,500
unique,500.0,500.0,500.0,10,10,
top,255.267017,105.080791,749.724206,I,A,
freq,1.0,1.0,1.0,59,59,
mean,,,,,,2023-01-07 20:55:40.800000
min,,,,,,2022-01-02 00:00:00
25%,,,,,,2022-07-13 12:00:00
50%,,,,,,2023-01-12 12:00:00
75%,,,,,,2023-07-04 12:00:00
max,,,,,,2023-12-29 00:00:00


# change data type

In [7]:

#You will notice that 'num1' num2' 'num3' is considered "object" not float ( refer to df.info () )
#Note: So it's normal for Python to labelled the datatype wrongly, so we have to fix it! - using astype

# to change data type

df['Num1'] = df['Num1'].astype ('float') # change one at a time
df[['Num2', 'Num3']] = df[['Num2', 'Num3']].astype ('float') # change multiple at a time

df.dtypes

Num1           float64
Num2           float64
Num3           float64
Cat1            object
Cat2            object
Date    datetime64[ns]
dtype: object

In [8]:

df.describe () # repeat the step to check the changes of data types

Unnamed: 0,Num1,Num2,Num3,Date
count,500.0,500.0,500.0,500
mean,250.348055,98.266409,749.652349,2023-01-07 20:55:40.800000
min,217.61007,42.936864,736.946561,2022-01-02 00:00:00
25%,243.690194,84.531964,746.078027,2022-07-13 12:00:00
50%,249.745402,98.064878,749.653589,2023-01-12 12:00:00
75%,257.320106,112.954596,752.967087,2023-07-04 12:00:00
max,275.19721,147.548326,764.748811,2023-12-29 00:00:00
std,9.680468,19.509242,5.024358,


In [9]:
df. describe(include = 'all')

Unnamed: 0,Num1,Num2,Num3,Cat1,Cat2,Date
count,500.0,500.0,500.0,500,500,500
unique,,,,10,10,
top,,,,I,A,
freq,,,,59,59,
mean,250.348055,98.266409,749.652349,,,2023-01-07 20:55:40.800000
min,217.61007,42.936864,736.946561,,,2022-01-02 00:00:00
25%,243.690194,84.531964,746.078027,,,2022-07-13 12:00:00
50%,249.745402,98.064878,749.653589,,,2023-01-12 12:00:00
75%,257.320106,112.954596,752.967087,,,2023-07-04 12:00:00
max,275.19721,147.548326,764.748811,,,2023-12-29 00:00:00
