### python

### Getting and Knowing your Data

This exercise involves pulling data directly from the internet and performing various operations to get to know the dataset better. We will be using the pandas library for data manipulation.

## Import the necessary libraries and import the dataset

In [2]:
import pandas as pd
users = pd.read_csv('https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user', 
                      sep='|', index_col='user_id')

## View the first 10 entries
The head method allows us to view the first 10 entries of the dataset to get an initial understanding of the data.

In [3]:
users.head(10)

Unnamed: 0_level_0,age,gender,occupation,zip_code
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,24,M,technician,85711
2,53,F,other,94043
3,23,M,writer,32067
4,24,M,technician,43537
5,33,F,other,15213
6,42,M,executive,98101
7,57,M,administrator,91344
8,36,M,administrator,5201
9,29,M,student,1002
10,53,M,lawyer,90703


## View the last 5 entries
The tail method lets us see the last 5 entries in the dataset.

In [4]:
users.tail(5)

Unnamed: 0_level_0,age,gender,occupation,zip_code
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
939,26,F,student,33319
940,32,M,administrator,2215
941,20,M,student,97229
942,48,F,librarian,78209
943,22,M,student,77841


## Number of observation in the dataset
he shape attribute gives us the dimensions of the DataFrame. The first element (0th index) represents the number of rows (observations).

In [5]:
users.shape[0]


943

## Number of columns in the dataset
The second element (1st index) of the shape attribute represents the number of columns.



In [6]:
users.shape[1]

4

## Print the name of all the columns
The columns attribute provides the names of all the columns in the dataset.



In [7]:
users.columns

Index(['age', 'gender', 'occupation', 'zip_code'], dtype='object')

## How is the dataset indexed?
The index attribute returns the index (or labels) of the DataFrame.

In [8]:
users.index

Index([  1,   2,   3,   4,   5,   6,   7,   8,   9,  10,
       ...
       934, 935, 936, 937, 938, 939, 940, 941, 942, 943],
      dtype='int64', name='user_id', length=943)

## Data type of each column
The dtypes attribute gives the data type of each column in the DataFrame.

In [9]:
users.dtypes

age            int64
gender        object
occupation    object
zip_code      object
dtype: object

## Print only the occupation column

We can select a single column from the DataFrame using square brackets or we can call that column by dot. This returns a Series containing the data of the 'occupation' column.

In [10]:
users.occupation

user_id
1         technician
2              other
3             writer
4         technician
5              other
           ...      
939          student
940    administrator
941          student
942        librarian
943          student
Name: occupation, Length: 943, dtype: object

In [11]:
users['occupation']

user_id
1         technician
2              other
3             writer
4         technician
5              other
           ...      
939          student
940    administrator
941          student
942        librarian
943          student
Name: occupation, Length: 943, dtype: object

## Unique occupation and Number of unique occupation
by using unique method we can return unique occupation.we can also return the number of the unique occupation by using the nunique method other than unique

In [12]:
users.occupation.unique()

array(['technician', 'other', 'writer', 'executive', 'administrator',
       'student', 'lawyer', 'educator', 'scientist', 'entertainment',
       'programmer', 'librarian', 'homemaker', 'artist', 'engineer',
       'marketing', 'none', 'healthcare', 'retired', 'salesman', 'doctor'],
      dtype=object)

In [13]:
users.occupation.nunique()

21

## Most frequent occupation
The value_counts method returns a Series containing counts of unique values, sorted in descending order. We select the first value to find the most frequent occupation.
By using head method we can controle  the row based on ore need and by using index we can control how much or wich index we are returning eg(first = 0, second =1) from those row 

In [14]:
users.occupation.value_counts().head(1).index[0]

'student'

In [15]:
users.occupation.value_counts().head(5)

occupation
student          196
other            105
educator          95
administrator     79
engineer          67
Name: count, dtype: int64

## Summarize the DataFrame
The describe method provides a summary of the numeric columns in the DataFrame.

In [16]:
users.describe()

Unnamed: 0,age
count,943.0
mean,34.051962
std,12.19274
min,7.0
25%,25.0
50%,31.0
75%,43.0
max,73.0


## Summarize all the columns
Including the parameter include='all' allows the describe method to return summary statistics for all columns, not just the numeric ones.


In [17]:
users.describe(include="all")

Unnamed: 0,age,gender,occupation,zip_code
count,943.0,943,943,943.0
unique,,2,21,795.0
top,,M,student,55414.0
freq,,670,196,9.0
mean,34.051962,,,
std,12.19274,,,
min,7.0,,,
25%,25.0,,,
50%,31.0,,,
75%,43.0,,,


## Summarize the occupation column
The describe method can also be applied to a single column to get its summary statistics.




In [18]:
users.occupation.describe()

count         943
unique         21
top       student
freq          196
Name: occupation, dtype: object

## Mean age of users
The mean method calculates the average age of the users, and round is used to round it to the nearest integer.


In [19]:
round(users.age.mean())

34

## Minimum of all columns in dataset
If your calling min method directly without any specification it will return minimum of possible columns 

In [20]:
users.min()

age                       7
gender                    F
occupation    administrator
zip_code              00000
dtype: object

## Age with most occurrence and Least occurrence 
The 'value_count' method will sort the entair data decending order the we can use 'head' or 'tail' methods based on our need eg( if you want first 5 data we can use head(5))

In [21]:
users.age.value_counts().head(10)

age
30    39
25    38
22    37
28    36
27    35
26    34
24    33
29    32
20    32
23    28
Name: count, dtype: int64

In [22]:
users.age.value_counts().tail(5)

age
7     1
66    1
11    1
10    1
73    1
Name: count, dtype: int64