# Ex3 - Getting and Knowing your Data

This time we are going to pull data directly from the internet.
Special thanks to: https://github.com/justmarkham for sharing the dataset and materials.

### Step 1. Import the necessary libraries

In [8]:
import pandas as pd
import numpy as np

### Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user). 

This data set appears to be an URL link https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user , however we need to still use read_csv to open the file in Jupiter notebook.



In [9]:
# Read the CSV file from the URL

url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user'

In [10]:
users = pd.read_csv(url, sep='|')

#### Notes :
- Using '|' as the Separator: When sep='|' is specified, it indicates that the values in the CSV file are separated by the pipe character (|). This is often used when dealing with CSV files where commas might be part of the data itself, or when the data is specifically formatted with pipes as delimiters instead of commas.
- sep Parameter: This parameter stands for "separator" and is used to specify the character or characters that separate values in the CSV file. By default, pd.read_csv() assumes the separator is a comma (,), which is why you don't usually need to specify sep if your file is comma-separated.

#### Display the DataFrame before changing the index

In [11]:
print("Original DataFrame ")
print(users)

Original DataFrame 
     user_id  age gender     occupation zip_code
0          1   24      M     technician    85711
1          2   53      F          other    94043
2          3   23      M         writer    32067
3          4   24      M     technician    43537
4          5   33      F          other    15213
..       ...  ...    ...            ...      ...
938      939   26      F        student    33319
939      940   32      M  administrator    02215
940      941   20      M        student    97229
941      942   48      F      librarian    78209
942      943   22      M        student    77841

[943 rows x 5 columns]


### Step 3. Use the 'user_id' as index

![image.png](attachment:image.png)

This step is asking to change indexing by default to user_id as index. Which method do we use? 

Set the 'user_id' column as the index: df.set_index('user_id', inplace=True) sets the 'user_id' column as the index of the DataFrame. The inplace=True parameter modifies the DataFrame in place.

In [12]:
users.set_index('user_id', inplace=True)

#### Displaying DataFrame after seting user_id as Index :

In [13]:
print("DataFrame with 'user_index' as Index: ")
print(users)

DataFrame with 'user_index' as Index: 
         age gender     occupation zip_code
user_id                                    
1         24      M     technician    85711
2         53      F          other    94043
3         23      M         writer    32067
4         24      M     technician    43537
5         33      F          other    15213
...      ...    ...            ...      ...
939       26      F        student    33319
940       32      M  administrator    02215
941       20      M        student    97229
942       48      F      librarian    78209
943       22      M        student    77841

[943 rows x 4 columns]


### Step 4. See the first 25 entries

In [14]:
users.head(25)

Unnamed: 0_level_0,age,gender,occupation,zip_code
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,24,M,technician,85711
2,53,F,other,94043
3,23,M,writer,32067
4,24,M,technician,43537
5,33,F,other,15213
6,42,M,executive,98101
7,57,M,administrator,91344
8,36,M,administrator,5201
9,29,M,student,1002
10,53,M,lawyer,90703


### Step 5. See the last 10 entries

In [15]:
users.tail(10)

Unnamed: 0_level_0,age,gender,occupation,zip_code
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
934,61,M,engineer,22902
935,42,M,doctor,66221
936,24,M,other,32789
937,48,M,educator,98072
938,38,F,technician,55038
939,26,F,student,33319
940,32,M,administrator,2215
941,20,M,student,97229
942,48,F,librarian,78209
943,22,M,student,77841


### Step 6. What is the number of observations in the dataset?

In [16]:
users.info()

<class 'pandas.core.frame.DataFrame'>
Index: 943 entries, 1 to 943
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   age         943 non-null    int64 
 1   gender      943 non-null    object
 2   occupation  943 non-null    object
 3   zip_code    943 non-null    object
dtypes: int64(1), object(3)
memory usage: 36.8+ KB


### Observations :
1. This DataFrame has an index of 943 entries starting from index 1 to 943
2. It contains 4 columns : "age", "Gender", "Occupation" and "Zipcode"
3. There no missing values in our columns --> All have "non null" values means no value is missing .
4. 3 objects and one column of type int64

Therefore, the number of observations in this dataset is 943. This means there are 943 rows or records in the DataFrame, each corresponding to a unique individual (assuming each row represents a unique user in this context).


*** Alternatively, we could use method shape()  but this would give us number of rows and columns only :

In [18]:
users.shape

(943, 4)

### Step 7. What is the number of columns in the dataset?

In [23]:

col_count = users.shape[1] # 1 is the column axis 
print("Column count is: ", col_count)

Column count is:  4


### Step 8. Print the name of all the columns.

In [24]:
print(users.columns)

Index(['age', 'gender', 'occupation', 'zip_code'], dtype='object')


or we could use a loop :

In [26]:
for column in users:
    print(column)

age
gender
occupation
zip_code


### Step 9. How is the dataset indexed?

In [27]:
users.index

Index([  1,   2,   3,   4,   5,   6,   7,   8,   9,  10,
       ...
       934, 935, 936, 937, 938, 939, 940, 941, 942, 943],
      dtype='int64', name='user_id', length=943)

Dataset is using "user_id" as an index starting from 1 to 943 .

### Step 10. What is the data type of each column?

In [29]:

print(users.dtypes)

age            int64
gender        object
occupation    object
zip_code      object
dtype: object


### Step 11. Print only the occupation column

In [30]:
print(users['occupation'])

user_id
1         technician
2              other
3             writer
4         technician
5              other
           ...      
939          student
940    administrator
941          student
942        librarian
943          student
Name: occupation, Length: 943, dtype: object


### Step 12. How many different occupations are in this dataset?

In [36]:
print("The number of different occupations in this DataFrame is ", len(users.unique()))

AttributeError: 'DataFrame' object has no attribute 'unique'

Do I need to convert "Occupation" column into string before using method unique().Lets try :

In [35]:
string_occupation = users['occupation'].astype(str)

In [33]:
string_occupation.dtype

dtype('O')

OUPS..still an object!

Do I need to clean up the column "Occupation" from possible trailing/leading whitespaces?

In [34]:
users_clean = users['occupation'].str.strip()

Even though df.info() indicates no missing values (NaN), it's good practice to explicitly check if there are any NaN values in the column of interest. we can do this using isnull() and any():

In [37]:
print(users['occupation'].isnull().any())


False


Confirmed there no Nan values.

Performed already many checks, able to use unique method now .

In [41]:
unique_occupations = users['occupation'].unique()
print("The number of unique occupations are: ", len(unique_occupations))


The number of unique occupations are:  21


### Step 13. What is the most frequent occupation?

### Step 14. Summarize the DataFrame.

### Step 15. Summarize all the columns

### Step 16. Summarize only the occupation column

### Step 17. What is the mean age of users?

### Step 18. What is the age with least occurrence?