### Exercise 1: Organize Dataset Directory

I created a new directory to store ZIP files and moved any existing `.zip` files into that folder using the commands below. If no ZIP files were found, the move command did nothing.

In [None]:
!mkdir -p ~/Documents/DataProblems/original_zips
!mv ~/Documents/DataProblems/*.zip ~/Documents/DataProblems/original_zips/

### Exercise 2: Split `diabetes_prediction_dataset.csv` into 3 parts

I used `head`, `tail`, `wc`, and redirection operators to split the file into 3 parts. Each file includes the header row and roughly 1/3 of the data.

In [6]:
!head -n 1 ~/Documents/DataProblems/diabetes_prediction_dataset.csv > part1.csv
!head -n 1 ~/Documents/DataProblems/diabetes_prediction_dataset.csv > part2.csv
!head -n 1 ~/Documents/DataProblems/diabetes_prediction_dataset.csv > part3.csv

# First 33333 rows
!tail -n +2 ~/Documents/DataProblems/diabetes_prediction_dataset.csv | head -n 33333 >> part1.csv

# Next 33333 rows (skip first 33333)
!tail -n +33335 ~/Documents/DataProblems/diabetes_prediction_dataset.csv | head -n 33333 >> part2.csv

!tail -n +66668 ~/Documents/DataProblems/diabetes_prediction_dataset.csv >> part3.csv

tail: stdout: Broken pipe
tail: stdout: Broken pipe


In [7]:
!wc -l part1.csv
!wc -l part2.csv
!wc -l part3.csv

   33334 part1.csv
   33334 part2.csv
   33335 part3.csv


### Exercise 3: Split `Heart_Disease_Prediction.csv` by Label

I created two new files: one containing rows with `"Presence"` and one with `"Absence"`. Each file includes the header row.

In [9]:
!head -n 1 ~/Documents/DataProblems/Heart_Disease_Prediction.csv > presence.csv
!head -n 1 ~/Documents/DataProblems/Heart_Disease_Prediction.csv > absence.csv

# Append matching rows
!grep "Presence" ~/Documents/DataProblems/Heart_Disease_Prediction.csv >> presence.csv
!grep "Absence" ~/Documents/DataProblems/Heart_Disease_Prediction.csv >> absence.csv

In [10]:
!wc -l presence.csv
!wc -l absence.csv

     121 presence.csv
     151 absence.csv


### Exercise 4: Fraction of Cars with No Accidents

I used `grep` and `wc -l` to count how many cars had `"no accidents"` and divided by the total number of rows (excluding the header).

In [11]:
!grep -i "no accident" ~/Documents/DataProblems/car_web_scraped_dataset.csv | wc -l

# Count total number of records (excluding header)
!tail -n +2 ~/Documents/DataProblems/car_web_scraped_dataset.csv | wc -l

    2223
    2840


### Exercise 5: Value Replacement in `Housing.csv`

I used `sed` with pipes to replace string values in the dataset and saved the result to a new file `Housing_Cleaned.csv`.

In [12]:
!cat ~/Documents/DataProblems/Housing.csv \
| sed 's/yes/1/g' \
| sed 's/no/0/g' \
| sed 's/semi-furnished/2/g' \
| sed 's/unfurnished/0/g' \
| sed 's/,furnished/,1/g' \
> Housing_Cleaned.csv

In [13]:
!head -n 10 Housing_Cleaned.csv

price,area,bedrooms,bathrooms,stories,mainroad,guestroom,basement,hotwaterheating,airconditioning,parking,prefarea,furnishingstatus
13300000,7420,4,2,3,1,0,0,0,1,2,1,1
12250000,8960,4,4,4,1,0,0,0,1,3,0,1
12250000,9960,3,2,2,1,0,1,0,0,2,1,2
12215000,7500,4,2,2,1,0,1,0,1,3,1,1
11410000,7420,4,1,2,1,1,1,0,1,2,0,1
10850000,7500,3,3,1,1,0,1,0,1,2,1,2
10150000,8580,4,3,4,1,0,0,0,1,2,1,2
10150000,16200,5,3,2,1,0,0,0,0,0,0,0
9870000,8100,4,1,2,1,1,1,0,1,2,1,1


### Exercise 6: Remove "CustomerID" from `Mall_Customers.csv`

I used the `cut` command to remove the first column (CustomerID) and saved the result as `Mall_Customers_Cleaned.csv`.

In [14]:
!cut -d ',' -f2- ~/Documents/DataProblems/Mall_Customers.csv > Mall_Customers_Cleaned.csv

In [15]:
!head -n 5 Mall_Customers_Cleaned.csv

Gender,Age,Annual Income (k$),Spending Score (1-100)
Male,19,15,39
Male,21,15,81
Female,20,16,6
Female,23,16,77


In [16]:
!head -n 1 ~/Documents/DataProblems/world\ all\ university\ rank\ and\ rank\ score.csv

rank,ranking-institution-title,location,Overall scores,Research Quality Score,Industry Score,International Outlook,Research Environment Score,Teaching Score


### Exercise 7: Sum Score Columns

I used `cut`, `tr`, and `bc` to extract 4 score columns and calculate their row-wise sum. The result is saved in `University_Score_Sum.csv`.

In [20]:
!tail -n +2 ~/Documents/DataProblems/world\ all\ university\ rank\ and\ rank\ score.csv \
| cut -d ',' -f5,6,7,8 \
| tr -d '\r' \
| grep -E '^[0-9]+(\.[0-9]+)?,[0-9]+(\.[0-9]+)?,[0-9]+(\.[0-9]+)?,[0-9]+(\.[0-9]+)?$' \
| tr ',' '+' \
| bc > University_Score_Sum.csv

In [21]:
!head University_Score_Sum.csv

378.2
367.2
340.5
361.2
353.3
363.0
335.9
354.9
329.1
355.7


### Exercise 8: Sort `cancer patient data sets.csv` by Age

I sorted the file by the "Age" column (column 3) using `sort` with `-t','`, `-k3`, and `-n`. The result is saved in `Cancer_Sorted.csv`.

In [23]:
!head -n 1 ~/Documents/DataProblems/cancer\ patient\ data\ sets.csv > Cancer_Sorted.csv

# Sort remaining rows by Age (3rd column) and append to output
!tail -n +2 ~/Documents/DataProblems/cancer\ patient\ data\ sets.csv \
| sort -t',' -k3 -n >> Cancer_Sorted.csv

In [24]:
!head Cancer_Sorted.csv

index,Patient Id,Age,Gender,Air Pollution,Alcohol use,Dust Allergy,OccuPational Hazards,Genetic Risk,chronic Lung Disease,Balanced Diet,Obesity,Smoking,Passive Smoker,Chest Pain,Coughing of Blood,Fatigue,Weight Loss,Shortness of Breath,Wheezing,Swallowing Difficulty,Clubbing of Finger Nails,Frequent Cold,Dry Cough,Snoring,Level
130,P215,14,1,2,4,5,6,5,5,4,6,5,4,6,5,5,3,2,1,4,7,2,1,6,Medium
19,P115,14,1,2,4,5,6,5,5,4,6,5,4,6,5,5,3,2,1,4,7,2,1,6,Medium
241,P315,14,1,2,4,5,6,5,5,4,6,5,4,6,5,5,3,2,1,4,7,2,1,6,Medium
35,P13,14,1,2,4,5,6,5,5,4,6,5,4,6,5,5,3,2,1,4,7,2,1,6,Medium
352,P415,14,1,2,4,5,6,5,5,4,6,5,4,6,5,5,3,2,1,4,7,2,1,6,Medium
574,P615,14,1,2,4,5,6,5,5,4,6,5,4,6,5,5,3,2,1,4,7,2,1,6,Medium
685,P715,14,1,2,4,5,6,5,5,4,6,5,4,6,5,5,3,2,1,4,7,2,1,6,Medium
796,P815,14,1,2,4,5,6,5,5,4,6,5,4,6,5,5,3,2,1,4,7,2,1,6,Medium
907,P915,14,1,2,4,5,6,5,5,4,6,5,4,6,5,5,3,2,1,4,7,2,1,6,Medium
