# Data Mining - Lab - 2

#  Numpy  &  Perform Data Exploration with Pandas

-------------------------------------------------------------------------------
## Numpy

1) NumPy (Numerical Python) is a powerful open-source library in Python used for numerical and scientific computing.<br>
2) It provides support for large, multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on them efficiently.<br>
3) NumPy is highly optimized and written in C, making it much faster than using regular Python lists for numerical operations.<br>
4) It serves as the foundation for many other Python libraries in data science and machine learning, like pandas, TensorFlow, and scikit-learn.<br>
5) With features like broadcasting, vectorization, and integration with C/C++ code, NumPy allows for cleaner and faster code in numerical computations.<br>



### Step 1. Import the Numpy library

In [2]:
import numpy as np



### Step 2. Create a 1D array of numbers

In [4]:
arr = np.array([1,2,3,4,5,6])
arr

array([1, 2, 3, 4, 5, 6])

In [15]:
arr = np.array([[1, 2, 3],[4, 5, 6]])
arr

array([[1, 2, 3],
       [4, 5, 6]])

In [18]:
# tuple from array is created
arr = np.array((1, 2, 3))
print(arr)

[1 2 3]


In [17]:
arr = np.arange(0,10)
print(arr)

[0 1 2 3 4 5 6 7 8 9]


In [10]:
ar = np.arange(1,10)
ar

array([1, 2, 3, 4, 5, 6, 7, 8, 9])

In [9]:
arr = np.arange(4)
arr

array([0, 1, 2, 3])

### Step 3. Reshape 1D to 2D Array

In [8]:
arr2d = np.arange(12).reshape(4,3)
arr2d

array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11]])

In [13]:
arr = np.array([1,2,3,4,5])
type(arr)
arr.dtype

dtype('int64')

In [41]:
arr = np.array([1,2,3,4,5])
type(arr)
arr.ndim

1

### Step 4. Create a Linspace array

In [3]:
arr = np.linspace(0, 1, 10)
print(arr)

[0.         0.11111111 0.22222222 0.33333333 0.44444444 0.55555556
 0.66666667 0.77777778 0.88888889 1.        ]


### Step 5. Create a Random Numbered Array

In [4]:
arr = np.random.randint(1, 101, size=10)
print(arr)

[53 81  8 43 89 88 12 87 23  3]


In [5]:
arr = np.random.rand(10)
print(arr)

[4.06031108e-01 4.27555352e-01 6.61591507e-01 3.88905569e-02
 8.27687062e-01 3.26030172e-01 9.87087271e-01 2.78305765e-01
 6.50626104e-01 6.29486528e-05]


In [6]:
arr = np.random.uniform(10, 50, size=10)
print(arr)

[35.74323571 43.89189736 10.57851089 42.02999017 42.5314231  23.67000884
 23.81574522 41.58695535 21.90963052 12.32477751]


### Step 6. Create a Random Integer Array

In [7]:
a1d = np.random.randint(1,50,size=10)
a1d

array([ 7, 49, 26, 17,  4, 43, 42, 46, 32,  5], dtype=int32)

In [None]:
# low-inclusive,high-exclusive and size is the count how many numbers we want
a2d = np.random.randint(1,20,size=30)
a2d

array([11,  2, 15,  1,  3, 19, 16,  7, 11, 10,  2, 13, 16,  7, 16,  4, 16,
        5,  6,  4,  6,  2, 15, 12, 14,  1, 16, 18, 14, 15], dtype=int32)

### Step 7. Create a 1D Array and get Max,Min,ArgMax,ArgMin

In [24]:
a1d = np.random.randint(1,50,size=10)
a1d

array([32, 17, 40,  3, 34, 29,  9, 15,  8, 27], dtype=int32)

In [None]:
print("Maximum is ",a1d.max())
print("Minimum is ",a1d.min())
print("Argumental max is ",a1d.argmax()) #gives the index of maximun number, with zero based index
print("Argumental min is ",a1d.argmin())

Maximum is  40
Minimum is  3
Argumental max is  2
Argumental min is  3


In [34]:
a2d = np.array([[1,2,3],[4,5,6],[7,8,9]])
# a2d = np.arange(1,13).reashape(3,4)
a2d

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In [35]:
print("Maximum is ",a2d.max())
print("Minimum is ",a2d.min())
print("Argumental max is ",a2d.argmax())
print("Argumental min is ",a2d.argmin())

Maximum is  9
Minimum is  1
Argumental max is  8
Argumental min is  0


### Step 8. Indexing in 1D Array

In [36]:
a1d = np.random.randint(1,50,size=10)
a1d

array([42,  9, 45, 42, 34, 25,  5,  5, 32, 14], dtype=int32)

In [37]:
a1d[5]

np.int32(25)

### Step 9. Indexing in 2D Array

In [44]:
a2d = np.arange(1,13).reshape(3,4)
a2d


array([[ 1,  2,  3,  4],
       [ 5,  6,  7,  8],
       [ 9, 10, 11, 12]])

In [40]:
a2d[2,3]

np.int64(12)

In [45]:
a2d = np.arange(1,13).reshape(3,4)
a2d[1][3]

np.int64(8)

### Step 10. Conditional Selection

In [12]:
arr = np.array([10, 15, 20, 25, 30])

selected = arr[arr > 20]
print(selected)

[25 30]


### 🔥You did it! 10 exercises down — you're on fire! 🔥

## Pandas



### Step 1. Import the necessary libraries

In [13]:
import pandas as pd
import numpy as np

### Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user). 

### Step 3. Assign it to a variable called users and use the 'user_id' as index

In [23]:
users = pd.read_csv("https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user")
users

Unnamed: 0,user_id|age|gender|occupation|zip_code
0,1|24|M|technician|85711
1,2|53|F|other|94043
2,3|23|M|writer|32067
3,4|24|M|technician|43537
4,5|33|F|other|15213
...,...
938,939|26|F|student|33319
939,940|32|M|administrator|02215
940,941|20|M|student|97229
941,942|48|F|librarian|78209


In [32]:
# with separator
users = pd.read_csv("https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user",sep='|')
users

Unnamed: 0,user_id,age,gender,occupation,zip_code
0,1,24,M,technician,85711
1,2,53,F,other,94043
2,3,23,M,writer,32067
3,4,24,M,technician,43537
4,5,33,F,other,15213
...,...,...,...,...,...
938,939,26,F,student,33319
939,940,32,M,administrator,02215
940,941,20,M,student,97229
941,942,48,F,librarian,78209


### Step 4. See the first 25 entries

In [33]:
users.head(25)

Unnamed: 0,user_id,age,gender,occupation,zip_code
0,1,24,M,technician,85711
1,2,53,F,other,94043
2,3,23,M,writer,32067
3,4,24,M,technician,43537
4,5,33,F,other,15213
5,6,42,M,executive,98101
6,7,57,M,administrator,91344
7,8,36,M,administrator,5201
8,9,29,M,student,1002
9,10,53,M,lawyer,90703


### Step 5. See the last 10 entries

In [34]:
users.tail(10)

Unnamed: 0,user_id,age,gender,occupation,zip_code
933,934,61,M,engineer,22902
934,935,42,M,doctor,66221
935,936,24,M,other,32789
936,937,48,M,educator,98072
937,938,38,F,technician,55038
938,939,26,F,student,33319
939,940,32,M,administrator,2215
940,941,20,M,student,97229
941,942,48,F,librarian,78209
942,943,22,M,student,77841


### Step 6. What is the number of observations in the dataset?

In [35]:
users.count()

user_id       943
age           943
gender        943
occupation    943
zip_code      943
dtype: int64

### Step 7. What is the number of columns in the dataset?

In [36]:
# rows
print("Numbers of row are :",users.shape[0])
# columns
print("Numbers of row are :",users.shape[1])

Numbers of row are : 943
Numbers of row are : 5


### Step 8. Print the name of all the columns.

In [37]:
print("Names of all coloums: ",users.columns)

Names of all coloums:  Index(['user_id', 'age', 'gender', 'occupation', 'zip_code'], dtype='object')


### Step 9. How is the dataset indexed?

In [38]:
users.index

RangeIndex(start=0, stop=943, step=1)

### Step 10. What is the data type of each column?

In [39]:
users.dtypes

user_id        int64
age            int64
gender        object
occupation    object
zip_code      object
dtype: object

### Step 11. Print only the occupation column

In [40]:
users['occupation']

0         technician
1              other
2             writer
3         technician
4              other
           ...      
938          student
939    administrator
940          student
941        librarian
942          student
Name: occupation, Length: 943, dtype: object

### Step 12. How many different occupations are in this dataset?

In [41]:
num_occupations = users["occupation"].nunique()
print("Number of different occupations:", num_occupations)

Number of different occupations: 21


### Step 13. What is the most frequent occupation?

In [42]:
# value_counts() returns a Series with unique occupations and their counts (sorted by default).

# idxmax() gives the index (occupation) with the maximum count.

most_frequent = users['occupation'].value_counts().idxmax()

print("Most frequent occupation:", most_frequent)

Most frequent occupation: student


### Step 14. Summarize the DataFrame.

In [48]:
users.describe()

Unnamed: 0,user_id,age
count,943.0,943.0
mean,472.0,34.051962
std,272.364951,12.19274
min,1.0,7.0
25%,236.5,25.0
50%,472.0,31.0
75%,707.5,43.0
max,943.0,73.0


### Step 15. Summarize all the columns

In [57]:

users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 943 entries, 0 to 942
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   user_id     943 non-null    int64 
 1   age         943 non-null    int64 
 2   gender      943 non-null    object
 3   occupation  943 non-null    object
 4   zip_code    943 non-null    object
dtypes: int64(2), object(3)
memory usage: 37.0+ KB


### Step 16. Summarize only the occupation column

In [53]:
users['occupation'].describe()

count         943
unique         21
top       student
freq          196
Name: occupation, dtype: object

### Step 17. What is the mean age of users?

In [54]:
mean_age = users['age'].mean()
print("Mean age of users:", mean_age)

Mean age of users: 34.05196182396607


### Step 18. What is the age with least occurrence?

In [56]:
age_counts = users['age'].value_counts()

least_common_age = age_counts[age_counts == age_counts.min()].index.tolist()

print("Age(s) with least occurrence:", least_common_age)


Age(s) with least occurrence: [7, 11, 66, 10, 73]


### You're not just learning, you're mastering it. Keep aiming higher! 🚀