# Data Mining - Lab - 2

#  Numpy  &  Perform Data Exploration with Pandas

-------------------------------------------------------------------------------
## Numpy

1) NumPy (Numerical Python) is a powerful open-source library in Python used for numerical and scientific computing.<br>
2) It provides support for large, multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on them efficiently.<br>
3) NumPy is highly optimized and written in C, making it much faster than using regular Python lists for numerical operations.<br>
4) It serves as the foundation for many other Python libraries in data science and machine learning, like pandas, TensorFlow, and scikit-learn.<br>
5) With features like broadcasting, vectorization, and integration with C/C++ code, NumPy allows for cleaner and faster code in numerical computations.<br>



### Step 1. Import the Numpy library

In [None]:
import numpy as np



### Step 2. Create a 1D array of numbers

In [None]:
arr = np.arange(11)
print(arr)
print(type(arr))

[ 0  1  2  3  4  5  6  7  8  9 10]
<class 'numpy.ndarray'>


In [None]:
arr = np.arange(2,20,2)
arr

array([ 2,  4,  6,  8, 10, 12, 14, 16, 18])

### Step 3. Reshape 1D to 2D Array

In [None]:
np.arange(2,22,2).reshape(2,5)

array([[ 2,  4,  6,  8, 10],
       [12, 14, 16, 18, 20]])

### Step 4. Create a Linspace array

In [None]:
np.linspace(1,3,5)

array([1. , 1.5, 2. , 2.5, 3. ])

### Step 5. Create a Random Numbered Array

In [None]:
np.random.rand(10)

array([0.32513051, 0.94605105, 0.72974458, 0.382629  , 0.32952675,
       0.77702468, 0.82697908, 0.99131081, 0.47693878, 0.4751378 ])

In [None]:
np.random.rand(2,4)

array([[0.44195197, 0.41047363, 0.42008562, 0.14183039],
       [0.89560966, 0.69863512, 0.90823267, 0.77124116]])

### Step 6. Create a Random Integer Array

In [None]:
np.random.randint(10, size=10)

[3, 2, 5, 4, 6, 7, 1, 6, 2, 3]

In [None]:
np.random.randint(10,100, size=10)

array([45, 39, 80, 51, 93, 21, 84, 17, 67, 26])

### Step 7. Create a 1D Array and get Max,Min,ArgMax,ArgMin

In [None]:
arr = np.random.randint(10,100, size=10)
arr

array([81, 63, 93, 41, 25, 11, 23, 63, 14, 83])

In [None]:
np.min(arr)

np.int64(11)

In [None]:
np.max(arr)

np.int64(93)

In [None]:
np.argmin(arr)

np.int64(5)

In [None]:
np.argmax(arr)

np.int64(2)

### Step 8. Indexing in 1D Array

In [33]:
arr[5]

np.int64(11)

In [34]:
arr[2:7]

array([93, 41, 25, 11, 23])

### Step 9. Indexing in 2D Array

In [38]:
arr = np.random.randint(10,100, size=10).reshape(2,5)
arr

array([[77, 80, 81, 76, 35],
       [88, 91, 51, 94, 47]])

In [44]:
arr[0]

array([77, 80, 81, 76, 35])

In [45]:
arr[0,1]

np.int64(80)

In [46]:
arr[0][2]

np.int64(81)

### Step 10. Conditional Selection

In [85]:
arr = np.random.randint(10,100, size=10)
print(arr)
print(arr[arr > 50])

[19 41 94 46 18 45 92 77 85 13]
[94 92 77 85]


In [86]:
arr = np.random.randint(10,100, size=10).reshape(2,5)
print(arr)
print(arr[arr > 50])

[[51 92 72 20 25]
 [38 61 42 19 59]]
[51 92 72 61 59]


### 🔥You did it! 10 exercises down — you're on fire! 🔥

## Pandas



### Step 1. Import the necessary libraries

In [47]:
import pandas as pd

### Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user).

### Step 3. Assign it to a variable called users and use the 'user_id' as index

In [114]:
data = pd.read_csv("https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user",sep="|").set_index('user_id')
data

Unnamed: 0_level_0,age,gender,occupation,zip_code
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,24,M,technician,85711
2,53,F,other,94043
3,23,M,writer,32067
4,24,M,technician,43537
5,33,F,other,15213
...,...,...,...,...
939,26,F,student,33319
940,32,M,administrator,02215
941,20,M,student,97229
942,48,F,librarian,78209


### Step 4. See the first 25 entries

In [115]:
data.head(25)

Unnamed: 0_level_0,age,gender,occupation,zip_code
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,24,M,technician,85711
2,53,F,other,94043
3,23,M,writer,32067
4,24,M,technician,43537
5,33,F,other,15213
6,42,M,executive,98101
7,57,M,administrator,91344
8,36,M,administrator,5201
9,29,M,student,1002
10,53,M,lawyer,90703


### Step 5. See the last 10 entries

In [116]:
data.tail(10)

Unnamed: 0_level_0,age,gender,occupation,zip_code
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
934,61,M,engineer,22902
935,42,M,doctor,66221
936,24,M,other,32789
937,48,M,educator,98072
938,38,F,technician,55038
939,26,F,student,33319
940,32,M,administrator,2215
941,20,M,student,97229
942,48,F,librarian,78209
943,22,M,student,77841


### Step 6. What is the number of observations in the dataset?

In [117]:
data.shape[0]

943

### Step 7. What is the number of columns in the dataset?

In [118]:
data.shape[1]

4

### Step 8. Print the name of all the columns.

In [119]:
data.columns

Index(['age', 'gender', 'occupation', 'zip_code'], dtype='object')

### Step 9. How is the dataset indexed?

In [96]:
data.index

RangeIndex(start=0, stop=943, step=1)

### Step 10. What is the data type of each column?

In [120]:
data.dtypes

Unnamed: 0,0
age,int64
gender,object
occupation,object
zip_code,object


### Step 11. Print only the occupation column

In [121]:
data["occupation"]

Unnamed: 0_level_0,occupation
user_id,Unnamed: 1_level_1
1,technician
2,other
3,writer
4,technician
5,other
...,...
939,student
940,administrator
941,student
942,librarian


### Step 12. How many different occupations are in this dataset?

In [125]:
len(data['occupation'].unique())

21

### Step 13. What is the most frequent occupation?

In [111]:
data.count()

Unnamed: 0,0
user_id,943
age,943
gender,943
occupation,943
zip_code,943


### Step 14. Summarize the DataFrame.

Unnamed: 0,age
count,943.0
mean,34.051962
std,12.19274
min,7.0
25%,25.0
50%,31.0
75%,43.0
max,73.0


### Step 15. Summarize all the columns

In [105]:
data.describe()

Unnamed: 0,user_id,age
count,943.0,943.0
mean,472.0,34.051962
std,272.364951,12.19274
min,1.0,7.0
25%,236.5,25.0
50%,472.0,31.0
75%,707.5,43.0
max,943.0,73.0


### Step 16. Summarize only the occupation column

In [124]:
data['occupation'].unique()

array(['technician', 'other', 'writer', 'executive', 'administrator',
       'student', 'lawyer', 'educator', 'scientist', 'entertainment',
       'programmer', 'librarian', 'homemaker', 'artist', 'engineer',
       'marketing', 'none', 'healthcare', 'retired', 'salesman', 'doctor'],
      dtype=object)

### Step 17. What is the mean age of users?

In [107]:
data['age'].mean()

np.float64(34.05196182396607)

### Step 18. What is the age with least occurrence?

age
7     1
66    1
11    1
10    1
73    1
Name: count, dtype: int64

### You're not just learning, you're mastering it. Keep aiming higher! 🚀