# The CSV Module
In this notebook, you'll learn:
1. How to read files
2. How to use the CSV reader method

There's no better way to learn this than by actually exploring a .csv file. I chose to explore the [2023 Data Scientists' Salaries](https://www.kaggle.com/datasets/henryshan/2023-data-scientists-salary) and conducted data analysis towards the end of the notebook. The dataset includes the following variables:

```work_year```: The year the salary was paid.


```experience_level```: The experience level in the job during the year.

```EN``` > Entry-level / Junior

```MI``` > Mid-level / Intermediate

```SE``` > Senior-level / Expert

```EX``` > Executive-level / Director


```employment_type```: The type of employment for the role.

```PT``` > Part-time

```FT``` > Full-time

```CT``` > Contract

```FL``` > Freelance


```job_title```: The role worked in during the year.


```salary```: The total gross salary amount paid.


```salary_currency```: The currency of the salary paid as an ISO 4217 currency code.


```salary_in_usd```: The salary in USD.


```employee_residence```: Employee's primary country of residence during the work year as an ISO 3166 country code.


```remote_ratio```:The overall amount of work done remotely.


```company_location```: The country of the employer's main office or contracting branch.


```company_size```: The median number of people that worked for the company during the year.

## Opening a file
---

One can use Python to open a file, be it .jpeg, .txt or .pdf. Our focus is the .csv file, so we'll stick to that. But first, let's look at Python's inbuilt function that can be used to open, read and write files -- [```open()```](https://docs.python.org/3/library/functions.html#open).

The ```open()``` function can take in 8 arguments, but only 1 is mandatory, the file name.

```py
open(file, mode='r', buffering=- 1, encoding=None, errors=None, newline=None, closefd=True, opener=None)
```

While the function can take in 8 arguments, we'll focus on 2:
* file - This is the file path of the file. When the file is in the same directory as your folder, then you can simply specify the name of the file you want to open, written as a string e.g ```"ds_salaries.csv"```. This is a mandatory argument and must be specified.
* mode - This tells python how it should open the file, based on your intentions. When not specified, it defaults to ```'r'``` which stands for  "read (only)" mode. Other modes include:

| Character | Meaning |
| --- | --- |
| ```'r'``` | open for reading (default) |
| ```'w'``` | open for writing, truncating the file first |
| ```'x'``` | open for exclusive creation, failing if the file already exists |
| ```'a'``` | open for writing, appending to the end of file if it exists |
| ```'b'``` | binay mode |
| ```'t'``` | textmode (default) |
| ```'+'``` | open for updating (reading and writing)  |

As you can see, you can open a file in may other modes apart from ```r```. Be careful when opening the file in "write" mode, ```w```, as this deletes everything that was in the file to enable you to write in it. If you want to add to the file without deleting its contents, you should use the "append" mode, ```a``` 

Once you are done with the file, it is imperative that you close it to avoid wasting the system's resources. Yes, even if you are using your big company's servers with the latest GPUs (I'm talking to you MAANG employees!).

Ok, now let's see all this in code.

In [40]:
f = open("ds_salaries.csv")
file_data = f.read()
f.close()

print(file_data)

work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
2023,SE,FT,Principal Data Scientist,80000,EUR,85847,ES,100,ES,L
2023,MI,CT,ML Engineer,30000,USD,30000,US,100,US,S
2023,MI,CT,ML Engineer,25500,USD,25500,US,100,US,S
2023,SE,FT,Data Scientist,175000,USD,175000,CA,100,CA,M
2023,SE,FT,Data Scientist,120000,USD,120000,CA,100,CA,M
2023,SE,FT,Applied Scientist,222200,USD,222200,US,0,US,L
2023,SE,FT,Applied Scientist,136000,USD,136000,US,0,US,L
2023,SE,FT,Data Scientist,219000,USD,219000,CA,0,CA,M
2023,SE,FT,Data Scientist,141000,USD,141000,CA,0,CA,M
2023,SE,FT,Data Scientist,147100,USD,147100,US,0,US,M
2023,SE,FT,Data Scientist,90700,USD,90700,US,0,US,M
2023,SE,FT,Data Analyst,130000,USD,130000,US,100,US,M
2023,SE,FT,Data Analyst,100000,USD,100000,US,100,US,M
2023,EN,FT,Applied Scientist,213660,USD,213660,US,0,US,L
2023,EN,FT,Applied Scientist,130760,USD,130760,US,0,US,L
2023,SE,FT,Data Mode

Now what exactly did I do?

First thing I did was to open the .csv file ```"ds_salaries.csv"```. I did this by calling the built-in Python function ```open()```

```py
open("ds_salaries.csv")
```

I wanted to open it in "read" mode, so there was no need of specifying the mode argument as it automatically defaults to this mode. However, you could include it if you wanted to be super clear.

```py
open("ds_salaries.csv", "r")
```

The ```open()``` function returns an object -- a file object to be precise. If you don't know what an object is, think of it as a data type just like a string, integer, float, boolean or list. So I stored this file object (or file datatype) in a variable which I named ```f```.

```py
f = open("ds_salaries.csv")
```

Remember, when you assign a function to a variable, whatever the function returns is what gets stored in the variable.

Objects have special functions called methods that work only on the object. For example:
* ```.split()``` is a string method (a special function that only works on string objects)
* ```.max``` is a list method (a special function that only works on list objects)

Likewise, the file object has methods. Examples are:
* ```.read()``` returns the data stored in the file as a text
* ```.close()``` closes the file.

Now, after opening the ```"ds_salaries.csv"``` file, and storing the file object in the variable ```f```, I called the ```.read()``` method to obtain all its data as a string. I then store this string in a variable ```file_data```.

```py
file_data = f.read()
```

Now, we can close the file to save up on resources.

```py
f.close()
```

I printed out the variable ```file_data``` just to ensure I have a string.

```py
print(file_data)
```

We can also check out the file data type with the ```type()``` function.

In [41]:
print(type(file_data))

<class 'str'>


Python provides another syntax for opening files, just in case you're like me and have the short attention span of a goldfish and you always forget to close the file. This special syntax auto-closes the file for you once you are done with it by using the ```with``` keyword and code indentation. Let's see how it works in code.

In [42]:
with open("ds_salaries.csv") as f:
    data = f.read()

print(data)

work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
2023,SE,FT,Principal Data Scientist,80000,EUR,85847,ES,100,ES,L
2023,MI,CT,ML Engineer,30000,USD,30000,US,100,US,S
2023,MI,CT,ML Engineer,25500,USD,25500,US,100,US,S
2023,SE,FT,Data Scientist,175000,USD,175000,CA,100,CA,M
2023,SE,FT,Data Scientist,120000,USD,120000,CA,100,CA,M
2023,SE,FT,Applied Scientist,222200,USD,222200,US,0,US,L
2023,SE,FT,Applied Scientist,136000,USD,136000,US,0,US,L
2023,SE,FT,Data Scientist,219000,USD,219000,CA,0,CA,M
2023,SE,FT,Data Scientist,141000,USD,141000,CA,0,CA,M
2023,SE,FT,Data Scientist,147100,USD,147100,US,0,US,M
2023,SE,FT,Data Scientist,90700,USD,90700,US,0,US,M
2023,SE,FT,Data Analyst,130000,USD,130000,US,100,US,M
2023,SE,FT,Data Analyst,100000,USD,100000,US,100,US,M
2023,EN,FT,Applied Scientist,213660,USD,213660,US,0,US,L
2023,EN,FT,Applied Scientist,130760,USD,130760,US,0,US,L
2023,SE,FT,Data Mode

Bingo! Same results! All this does is call the ```open()``` function and assigns the file object to a variable ```f```. We can now reference the file object ```f``` while within the ```with``` code block. Once outside the code block, the file is automatically closed. Pretty neat, huh?

Ok, that's cool. But what now? We have the data as a single long string which is not really exciting in the field of data analysis. Well, we can turn it into a list...

In [43]:
salaries = data.split(',')
print(salaries)

['work_year', 'experience_level', 'employment_type', 'job_title', 'salary', 'salary_currency', 'salary_in_usd', 'employee_residence', 'remote_ratio', 'company_location', 'company_size\n2023', 'SE', 'FT', 'Principal Data Scientist', '80000', 'EUR', '85847', 'ES', '100', 'ES', 'L\n2023', 'MI', 'CT', 'ML Engineer', '30000', 'USD', '30000', 'US', '100', 'US', 'S\n2023', 'MI', 'CT', 'ML Engineer', '25500', 'USD', '25500', 'US', '100', 'US', 'S\n2023', 'SE', 'FT', 'Data Scientist', '175000', 'USD', '175000', 'CA', '100', 'CA', 'M\n2023', 'SE', 'FT', 'Data Scientist', '120000', 'USD', '120000', 'CA', '100', 'CA', 'M\n2023', 'SE', 'FT', 'Applied Scientist', '222200', 'USD', '222200', 'US', '0', 'US', 'L\n2023', 'SE', 'FT', 'Applied Scientist', '136000', 'USD', '136000', 'US', '0', 'US', 'L\n2023', 'SE', 'FT', 'Data Scientist', '219000', 'USD', '219000', 'CA', '0', 'CA', 'M\n2023', 'SE', 'FT', 'Data Scientist', '141000', 'USD', '141000', 'CA', '0', 'CA', 'M\n2023', 'SE', 'FT', 'Data Scientist',

Well, now we have even more problems.
* The data is in a single long list and that is still hard to draw insights from.
* The newline characters ```\n``` corrupts our data

What if we want to get our data as a list of lists? - Each row is a list, in one big list.
The newline characters, ```\n```, that seperates rows, should tell our program to create a new list.

Well, we can do this manually, but this is tedious. Someone had already done all the heavy lifting for us by creating the ```csv``` module.

## CSV Reader
---

How can available CSV methods help us solve the problem we initially created? It's great that we can read CSV files using Python, but how can we get them from a string into a powerful list of lists that we can analyze? First, we need to import the csv module.

In [44]:
import csv

Then we open out file as usual, but this time, we read it using csv's ```reader()``` function as opposed to using Python's inbuilt ```.read()``` method. This helps us achieve our goal of getting a list of list since this function goes through all the rows, each new row being made a list. Observe:

In [45]:
with open("ds_salaries.csv") as f:
    csv_object = csv.reader(f)
    list_of_lists = list(csv_object)
    
print(list_of_lists)

[['work_year', 'experience_level', 'employment_type', 'job_title', 'salary', 'salary_currency', 'salary_in_usd', 'employee_residence', 'remote_ratio', 'company_location', 'company_size'], ['2023', 'SE', 'FT', 'Principal Data Scientist', '80000', 'EUR', '85847', 'ES', '100', 'ES', 'L'], ['2023', 'MI', 'CT', 'ML Engineer', '30000', 'USD', '30000', 'US', '100', 'US', 'S'], ['2023', 'MI', 'CT', 'ML Engineer', '25500', 'USD', '25500', 'US', '100', 'US', 'S'], ['2023', 'SE', 'FT', 'Data Scientist', '175000', 'USD', '175000', 'CA', '100', 'CA', 'M'], ['2023', 'SE', 'FT', 'Data Scientist', '120000', 'USD', '120000', 'CA', '100', 'CA', 'M'], ['2023', 'SE', 'FT', 'Applied Scientist', '222200', 'USD', '222200', 'US', '0', 'US', 'L'], ['2023', 'SE', 'FT', 'Applied Scientist', '136000', 'USD', '136000', 'US', '0', 'US', 'L'], ['2023', 'SE', 'FT', 'Data Scientist', '219000', 'USD', '219000', 'CA', '0', 'CA', 'M'], ['2023', 'SE', 'FT', 'Data Scientist', '141000', 'USD', '141000', 'CA', '0', 'CA', 'M'

The ```reader()``` function reads the file but instead of returning a string, it returns a csv object. To turn this into a list of lists, we just need to call on the list function on this csv object. Now we can draw meaningful insights from this data. That's it for now.

# Data Analysis in Base Python

---

Alright, now that we know how to work with CSV files, let's actually draw some insights from this data without the use of any third party libraries. I'm curious to know:
1. Salaries:
    - What's the average salary of a data scientist?
        - What's their average salary in each year?
    - In which year were data scientists paid the most and vice versa?
        - Which year had the single highest salary and vice versa?
    - How do the salaries differ? Measure their spread:
        - Overally
        - In each year
        - In each experience level
        - In each experience level in each year
2. Experience:
    - Which experience level has the highest & lowest average salary?
        - Which experience level had the highest & lowest salary in each year?
3. Employment Type:
    - Which employment type has the highest and lowest average salaries?
        - Which employment type has the highest and lowest average salaries in each year?
4. Job Titles (Roles):
    - How many job titles does the realm of data science have? And which one has the highest average salary?
        - Which role has the highest & lowest paying salary?
        - Which role has the highest & lowest average paying salary in each year?
5. Currency:
    - Which currency paid data scientists the most and least averagely?
        - Which currency paid data scientists the most and least averagely in each year?
6. Employee Residence:
    - Where do the highest and least paid data scientiests reside averagely?
        - How does this vary from year to year (if at all it does)?
7. Remote Ratio:
    - How many data scientists work fully remote, fully in office and hybrid? How do their average salaries compare?
        - How do their salaries compare in each year?
8. Company Location:
    - Where are the companies that averagely pay the most & least located?
        - How does this vary from year to year?
9. Company Size:
    - Which company size pays the most and least averagely?
        - How do the average salaries of each company size compare in each year?

In [46]:
# Import Statements:
from ProbStatipy import central, spread

Wait a minute! Hold up!

![Hold up meme: A drawing of a cartoon stretching its hands signaling the viewer to stop with the caption "Wait a minute, Hold Up"](https://th.bing.com/th/id/OIP.UjDCCX5rd_2CFE7gFXxyDwHaFX?w=285&h=206&c=7&r=0&o=5&dpr=1.5&pid=1.7)

I thought we was doing analysis using raw Python! Why you importing weird modules for?

Let me explain. [`ProbStatipy`](https://github.com/0gregory0/ProbStatipy) is my package. I made it in base Python and published it on [PyPI](https://pypi.org/project/ProbStatipy/). It relies on `math`, a module in the Python Standard Library. It contains functions to measure central tendancies such as mean, median and mode, and dispersion such as variance, standard deviation, mean absolute deviation, range and interquartile range.

I figured that there's no need to rewrite these functions when I can actually reuse my own. Now that that's out of the way, let's continue.

We can start by farmiliarizing ourselves with the data.

In [47]:
# Reopening the csv file
with open("ds_salaries.csv") as f:
    csv_object = csv.reader(f)
    data = list(csv_object)

# Printing out the first 5 rows
print(data[:5])

[['work_year', 'experience_level', 'employment_type', 'job_title', 'salary', 'salary_currency', 'salary_in_usd', 'employee_residence', 'remote_ratio', 'company_location', 'company_size'], ['2023', 'SE', 'FT', 'Principal Data Scientist', '80000', 'EUR', '85847', 'ES', '100', 'ES', 'L'], ['2023', 'MI', 'CT', 'ML Engineer', '30000', 'USD', '30000', 'US', '100', 'US', 'S'], ['2023', 'MI', 'CT', 'ML Engineer', '25500', 'USD', '25500', 'US', '100', 'US', 'S'], ['2023', 'SE', 'FT', 'Data Scientist', '175000', 'USD', '175000', 'CA', '100', 'CA', 'M']]


## Preparing / Cleaning the Data
Before we start analyzing the data, we must first clean it and make it ready for the analysis ahead.

In [48]:
# separating the title row from the rows containing the actual data
title_row = data[0]
data_rows = data[1:]

# Changing the numerical columns into numerical values:

## 'work_year' should be ints
work_year_index = title_row.index('work_year') # getting the index of the 'work_year' column

for row in data_rows:
    row[0] = int(row[0])

## 'salary' should be float
salary_index = title_row.index('salary') # getting the index of the 'salary' column

for row in data_rows:
    row[salary_index] = float(row[salary_index])

## 'salary_in_usd' should be float
usd_salary_index = title_row.index("salary_in_usd") # getting the index of the 'salary_in_usd' column

for row in data_rows:
    row[usd_salary_index] = float(row[usd_salary_index])

## 'remote_ratio' should be int
remote_ratio_index = title_row.index('remote_ratio') # getting the index of the 'remote_ratio_index' column

for row in data_rows:
    row[remote_ratio_index] = int(row[remote_ratio_index])

## Analyzing the Data.
Finally, let's draw some insights!

### 1. Salaries
#### The average salary of a data scientist.
For this, and the rest of the questions, we'll analyse the salary in USD (`salary_in_usd`) column because having a uniform currency will help us draw reasonable insights.

In [49]:
# creating a list of salaries to get their means
salaries = []

for row in data_rows:
    salaries.append(row[usd_salary_index]) # changing them from strings to floats

# getting the average salary of a data scientist
avg_salary = central.mean(salaries)
avg_salary_rounded = round(avg_salary)

print(avg_salary)
print(avg_salary_rounded)


137570.38988015978
137570


#### The average salary for each year
To get this, we must first get all the years that are in our data, then compute the mean for each year.

In [50]:
#### Getting all the years in our data ####
#### --------------------------------- ####

# "work_years" is the first column of the data with index [0]

# Getting the unique list of years
years = []
for row in data_rows:
    if row[0] in years:
        continue
    else:
        years.append(row[0])

# Sorting the unique list of years
years.sort()

print(years)

[2020, 2021, 2022, 2023]


In [51]:
#### Computing the average salaries for each year ####
#### -------------------------------------------- ####

# for the year 2020:
salaries_2020 = []
for row in data_rows:
    if 2020 in row:
        salaries_2020.append(row[usd_salary_index])

avg_sal_2020 = central.mean(salaries_2020)
avg_sal_2020_rounded = round(avg_sal_2020)

# for the year 2021:
salaries_2021 = []
for row in data_rows:
    if 2021 in row:
        salaries_2021.append(row[usd_salary_index])

avg_sal_2021 = central.mean(salaries_2021)
avg_sal_2021_rounded = round(avg_sal_2021)

# for the year 2022:
salaries_2022 = []
for row in data_rows:
    if 2022 in row:
        salaries_2022.append(row[usd_salary_index])

avg_sal_2022 = central.mean(salaries_2022)
avg_sal_2022_rounded = round(avg_sal_2022)

# for the year 2023:
salaries_2023 = []
for row in data_rows:
    if 2023 in row:
        salaries_2023.append(row[usd_salary_index])

avg_sal_2023 = central.mean(salaries_2023)
avg_sal_2023_rounded = round(avg_sal_2023)

print(avg_sal_2020)
print(avg_sal_2020_rounded)
print('')

print(avg_sal_2021)
print(avg_sal_2021_rounded)
print('')

print(avg_sal_2022)
print(avg_sal_2022_rounded)
print('')

print(avg_sal_2023)
print(avg_sal_2023_rounded)

92302.63157894737
92303

94087.20869565217
94087

133338.62079326922
133339

149045.54117647058
149046


#### Highest & Lowest Salary Recorded
Getting this will help us answer questions such as:
> "Which year had the single highest salary and vice versa?"

We get the highest and lowest salaries by using our list of salaries. We can then pin these salaries to their respective rows and draw insights.

In [52]:
# Determining the highest and lowest salary ever recorded
highest_salary = max(salaries)
lowest_salary = min(salaries)

# Getting the row(s) where these salaries were recorded
highest_sal_rows = []
lowest_sal_rows = []
for row in data_rows:
    if highest_salary == row[usd_salary_index]:
        highest_sal_rows.append(row)
    elif lowest_salary == row[usd_salary_index]:
        lowest_sal_rows.append(row)
    else:
        pass

print(highest_salary)
print(highest_sal_rows)
print(f'{highest_sal_rows[0][0]} is the year with the highest salary ever recorded')
print('')

print(lowest_salary)
print(lowest_sal_rows)
print(f'{lowest_sal_rows[0][0]} is the year with the lowest salary ever recorded')

450000.0
[[2020, 'MI', 'FT', 'Research Scientist', 450000.0, 'USD', 450000.0, 'US', 0, 'US', 'M']]
2020 is the year with the highest salary ever recorded

5132.0
[[2022, 'MI', 'FT', 'NLP Engineer', 120000.0, 'CZK', 5132.0, 'CZ', 100, 'CZ', 'M']]
2022 is the year with the lowest salary ever recorded


#### Dispersion or spread of salaries
How are the salararies spread out:
- Overally
- In each year
- In each experience level
- In each experience level in each year

##### Overally

In [53]:
sal_stdev = spread.stdeviation(salaries)
sal_range = spread.get_range(salaries)
sal_iqr = spread.iqr(salaries)

print(f'''The salaries in this dataset have a:
      standard deviation of {sal_stdev} approx. {round(sal_stdev)}
      range of {sal_range}
      and an interquartile range of {sal_iqr}''')

The salaries in this dataset have a:
      standard deviation of 63047.22849740541 approx. 63047
      range of 444868.0
      and an interquartile range of 80000.0


###### Interpretation
**Range** is the difference between the highest and the lowest value in our data set. In this case, the difference between the highest and the lowest earner is `$444868.0`!

**Interquartile range** is the difference between the highest quartile (Q3) and the lowest quartile (Q1). It get's rid of outliers by dividing the data into 4 groups in ascending order, then disregarding the 1st and the last group when determining the range. Without the 25% of data from either side, difference between the highest and the lowest data point is `$80000.0`

            |----------------+----------------+----------------+---------------|
        starting    group   Q1      group     Q2    group      Q3   group   ending
        point         1               2    (median)   3               4     point

**Standard Deviation** shows us the average deviation from the mean. In our dataset, the average salary was `$137570`. The average difference between the mean and each datapoint is `$63047`. A high standard deviation, such as this, shows the uncertainty as to how much a data scientist's work is really valued at.

The standard deviation also comes in handy in determining usual and unusual values.
**Usual values** are those that lie within two standard deviations from the mean and **unusual values** are those that lie without.

    -----------------------+------------------------+------------------------+-----------------------
        <-  unusual      mean -       usual        mean       usual        mean +          unusual ->
            values      2(stdev)      values                  values      2(stdev)         values

Unusual values are the values that are least likely to occur and have a probability of 5% (.05) or less. Let's find out our unusual values.

In [54]:
upper_usual_limit = avg_salary + (2 * sal_stdev)
lower_usual_limit = avg_salary - (2 * sal_stdev)

print(f'The highest usual salary is{upper_usual_limit} and the lowest usual salary is {lower_usual_limit}')
print('')

low_unusual_sal = []
high_unusual_sal = []

for salary in salaries:
    if salary < lower_usual_limit:
        low_unusual_sal.append(salary)
        low_unusual_sal.sort()
    elif salary > upper_usual_limit:
        high_unusual_sal.append(salary)
        high_unusual_sal.sort()
    else:
        pass

unusual_sal = low_unusual_sal + high_unusual_sal

print(f'The unusual salaries are {unusual_sal}')
print('')

print(f'The unusually low salaries are {low_unusual_sal}')
print('')

print(f'The unusually high salaries are {high_unusual_sal}')

The highest usual salary is263664.8468749706 and the lowest usual salary is 11475.932885348957

The unusual salaries are [5132.0, 5409.0, 5409.0, 5679.0, 5707.0, 5723.0, 5882.0, 6072.0, 6072.0, 6270.0, 6304.0, 6359.0, 7000.0, 7500.0, 7799.0, 8000.0, 8000.0, 8050.0, 9272.0, 9289.0, 9466.0, 9727.0, 10000.0, 10000.0, 10000.0, 10000.0, 10000.0, 10354.0, 265000.0, 265000.0, 265000.0, 266400.0, 269000.0, 269600.0, 270000.0, 270703.0, 272000.0, 272550.0, 272550.0, 275000.0, 275000.0, 275000.0, 275000.0, 275000.0, 275000.0, 275300.0, 275300.0, 275300.0, 275300.0, 275300.0, 275300.0, 275300.0, 275300.0, 276000.0, 276000.0, 276000.0, 280000.0, 280100.0, 280700.0, 283200.0, 284000.0, 284310.0, 285800.0, 286000.0, 288000.0, 288000.0, 289076.0, 289076.0, 289800.0, 289800.0, 290000.0, 291500.0, 291500.0, 293000.0, 297300.0, 297300.0, 297300.0, 297300.0, 297500.0, 299500.0, 299500.0, 299500.0, 299500.0, 299500.0, 300000.0, 300000.0, 300000.0, 300000.0, 300000.0, 300000.0, 300000.0, 300000.0, 300000.0

In [57]:
print(len(salaries))
print(len(unusual_sal))
print(0.05 * len(salaries))

3755
137
187.75


**Interpretation**

As mentioned earlier, it's normal to get that 5% or less of your data consists of unusual values. We have 3755 rows in our data set, therefore, we can expect at most 188 rows to contain unusual values. True to this fact, our dataset contains only 137 entries of unusual salaries. Unusual values are not bad, they are just less likely to occur.

The highest usual salary is approx. `263664` and the lowest usual salary is approx. `11475`. Therefore, it's unlikely, but not impossible, to get a data scientist earning below `$11,475` or above `$263,664`.