#### Lab 1: Data analysis with numpy, review Python

In [None]:
# Name: Aryan Singhal

In this lab you will investigate the student enrollment numbers for California Community Colleges.
<br>
<br>Note:
<br>- <u>Do not use pandas</u> for this lab
<br>- Take advantage of numpy's functions instead of writing loops to access data
<br>If your code doesn't contain any loop to access data in the numpy arrays, you earn 1pt EC. Loops just to print data is okay.

There are 3 input files for this lab ([source](https://datamart.cccco.edu/Students/Enrollment_Status.aspx)):<br>
- `colleges.txt`: contains a list of all CA community college names, one name per line.<br>
- `students.csv`: contains a table of student enrollment numbers. Each row is for one community college, and each row contains 13 columns for enrollment numbers from Fall 2019 to Fall of 2023.<br>
- `semesters.csv`: contains the semester name and year for the 13 columns of students.csv.

In [None]:
# import modules
import numpy as np

1. Read in data from the input files.

From `colleges.txt` file, read and store all the college names.<br> 
Then __print the number of colleges__, with a text explanation.<br>

From `students.csv` file, read and store all the enrollment data _as integers_.<br>
Then __print the number of rows and columns of the student table__, with a text explanation.<br>

From `semesters.csv` file, read and store all the column headers.<br>
Then __print the number of semesters__, with a text explanation.<br>

Sample print output:<br>
`Number of colleges: 116`<br>
`(Rows,columns) of student enrollment data: (116, 13)`<br>
`Number of semesters: 13`

In [None]:
with open('colleges.txt', 'r') as file:
    colleges = file.read().splitlines()

print("Number of colleges:", len(colleges))

In [None]:
students = np.genfromtxt('students.csv', delimiter=',', skip_header=1, dtype=int)
print("(Rows, columns) of student enrollment data:", students.shape)

In [None]:
semesters = np.genfromtxt('semesters.csv', delimiter=',', dtype=str)
if semesters.ndim == 0:
    semesters = np.array([semesters])
print("Number of semesters:", semesters.size)

2. __Print all the semesters and years__ in the array of semesters.<br>
The print out should be 3 semesters of an academic year per line, with comma in between.

Sample first 3 lines of output:<br>
`Semesters:`<br>
`Fall 2019, Spring 2020, Summer 2020`<br>
`Fall 2020, Spring 2021, Summer 2021`

In [None]:
print("Semesters:")
for i in range(0, len(semesters), 3):
    semester_group = semesters[i:i+3]
    print(", ".join(semester_group))

3. __Print the total number of students across all colleges for each Fall semester__<br>
Sample first 3 lines of output:<br>
`Total students for each Fall:`<br>
`Fall 2019: 1,638,694`<br>
`Fall 2020: 1,454,450`<br>

- Use f-string formatting:  `f'{number:,d}'`  to print integers with comma in the thousands place.<br>
- The semester name and year should come from the semester array, don't hard code them.

In [None]:
print("Total students for each Fall:")
fall_indices = [i for i, s in enumerate(semesters) if s.startswith("Fall")]
for index in fall_indices:
    total_students = students[:, index].sum()
    print(f"{semesters[index]}: {total_students:,d}")

4. __Find the average number of students per semester__ for each college and store in a numpy array.<br>
Then __print the college name and average number of students per semester for the smallest college__<br>
and __the college name and average number of students per semester for the largest college__.

For each output, print a text explanation such as: 'CollegeA is the smallest with 100 students per semester'

In [None]:
average_students_per_semester = np.mean(students, axis=1)

index_smallest = np.argmin(average_students_per_semester)
index_largest = np.argmax(average_students_per_semester)

smallest_college = colleges[index_smallest]
largest_college = colleges[index_largest]

average_smallest = average_students_per_semester[index_smallest]
average_largest = average_students_per_semester[index_largest]

print(f'{smallest_college} is the smallest with {average_smallest:,.0f} students per semester')
print(f'{largest_college} is the largest with {average_largest:,.0f} students per semester')


5. Using the array of average number of students per semester from step 4, __find the average number of students per semester for De Anza College__.
- De Anza is stored as 'Deanza' in the array of college names.
- Use numpy to find 'Deanza' in your arrays, don't hard code the index value for De Anza.
- Print an explanation along with the college name and average number of students, such as:<br>
   'Deanza has 1000 students per semester'

6. Using the array of average number of students per semester from step 4, and using the fact that each academic year has 3 semesters (Fall, Spring, Summer), __print the average number of students per academic year across all colleges__.<br>
- The output should be one number, which is the total number of students per year from all colleges.
- Print an explanation along with the number.
- Since the number will be large, print comma at the thousands place.

In [None]:
total_students_per_year = np.sum(average_students_per_semester) * 3

print(f'The average number of students per academic year across all colleges is {total_students_per_year:,.0f}')


7a. Using the array of average number of students per semester from step 4,<br>
__print the number of colleges that are in the 25th percentile__ of the average number of students,<br>
and __print the number of colleges that are in the 75th percentile__ of the average number of students.<br>
Print an explanation along with each number.                                                           

In [15]:
percentile_25 = np.percentile(average_students_per_semester, 25)
percentile_75 = np.percentile(average_students_per_semester, 75)

colleges_at_25th_percentile = np.sum(average_students_per_semester <= percentile_25)

colleges_at_75th_percentile = np.sum(average_students_per_semester <= percentile_75)

print(f'Number of colleges in the 25th percentile: {colleges_at_25th_percentile}')
print(f'Number of colleges in the 75th percentile: {colleges_at_75th_percentile}')

Number of colleges in the 25th percentile: 29
Number of colleges in the 75th percentile: 86


7b. Are the 2 numbers in the output of step 7a the same?<br>
__Create a Raw NBConvert cell to explain why or why not__.

__7b. Answer:__
No, the two numbers in the output of step 7a are not the same. The number of colleges in the 25th percentile is 29, while the number of colleges in the 75th percentile is 86.

The reason for this difference is because of how the percentiles are calculated and what they represent. The 25th percentile means that 25% of the data points fall below that value, while the 75th percentile means that 75% of the data points fall below that value.

In this case, it indicates that there are 29 colleges where the average number of students per semester is at or below the 25th percentile of all colleges, and there are 86 colleges where the average number of students per semester is at or below the 75th percentile of all colleges.

These numbers can differ significantly because the distribution of average students per semester across colleges is not uniform. There may be a larger concentration of colleges with lower average student numbers, leading to fewer colleges in the higher percentiles.

8. __Find the percent change in enrollment between Fall 2019 and Fall 2023__.<br>
The percent change = (Fall 2023 enrollment - Fall 2019 enrollment) / Fall 2019 enrollment<br>
then multiply by 100 to convert to percentage.<br>
Then __print the college and percent change for the college with the largest change__,<br>
and __print the college and percent change for the college with the smallest change__.<br>
- The percentage should be rounded to an integer
- Print an explanation along with the number

9. Using the percent change in enrollment between Fall 2019 and Fall 2023 from step 8,<br>
__print the number of colleges where the percent change is less than +/-2%__.