In [1]:
SR1: Data Sourcing

Dataset Source The dataset used for this analysis was created by me and is based on publicly available data regarding South African informal settlements. It includes information on various aspects such as Water Supply, Sanitation, Electricity, Population, Health Issues, Social Inequality, Economic Opportunities, Unemployment Rate, and Crime Rate for each of South Africa's 9 provinces.

Suitability for Analysis 1.Data Completeness:The dataset covers multiple dimensions relevant to informal settlements, providing a comprehensive view of the conditions and challenges faced in these areas. Each province's data is detailed and complete, allowing for a thorough analysis.

2.Relevance:The chosen columns, such as Water Supply and Sanitation, directly relate to essential services that are critical in informal settlements. These columns highlight infrastructure challenges and basic needs fulfillment, crucial for understanding the living conditions in these areas.

3.Credibility of Source: While the dataset is created by me, it is based on reliable sources and publicly available data. The methodology used to compile the dataset ensures accuracy and reliability, making it suitable for analysis and drawing meaningful insights.

Columns Justification -Water Supply:This column indicates the availability and quality of water services in informal settlements, a fundamental aspect of living conditions and community health. -Unemployment Rate:Understanding the unemployment rate in these areas sheds light on economic challenges and opportunities for livelihood improvement, crucial for social development and policy planning.







SR2:Pre-Processing 
import pandas as pd

df = pd.read_csv('C:/Users/27728/OneDrive/Documents/South African Informal Settlement.csv')

print("Dataset Size:", df.shape)
print("Columns:", df.columns)
print("Data Types:")
print(df.dtypes)
print("Data Slicing and Indexing:")
print(df.head())
missing_data = df.isnull().sum()
print("Missing Data:")
print(missing_data)

numeric_cols = df.select_dtypes(include=['int64', 'float64']).columns
for  col in numeric_cols:
df[col] = (df[col] - df[col].mean()) / df[col].std()

df.to_csv('my_cleaned_dataset.csv', index=False)




SR2 Machine Learning/Statistical Data Types Description

Water Supply (float): Represents the availability of water services, a continuous numerical variable.
Sanitation (float): Indicates the quality of sanitation facilities, another continuous numerical variable.
Electricity (float): Measures the access to electricity, also a continuous numerical variable.
Population (int): Represents the population size in the informal settlements, an integer variable.
Health Issues (bool): Indicates the presence or absence of health issues, a binary categorical variable (True or False).
Social Inequality (float): Measures the level of social inequality, a continuous numerical variable.
Economic Opportunities (string): Describes the economic opportunities available, a categorical variable (e.g., Low, Medium, High).
Unemployment Rate (float): Represents the percentage of unemployed individuals, a continuous numerical variable.
Crime Rate (float): Measures the level of crime in the area, another continuous numerical variable.
Province ( string): Represents the 9 provinces of South 





SR3: Explainatory Data Analysis(EDA)

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv('C:/Users/27728/OneDrive/Documents/South African Informal Settlement.csv')


print("Dataset Size:", df.shape)
print("Columns:", df.columns)
print("Data Types:")
print(df.dtypes)
print("Data Slicing and Indexing:")
print(df.head())
missing_data = df.isnull().sum()
print("Missing Data:")
print(missing_data)


Summary Statistics:
summary_stats = df.describe()
print("Summary Statistics:")
print(summary_stats)

Basic Infographic Charts:
1. Bar Chart 
plt.figure(figsize=(10, 6))
sns.barplot(x='PROVINCE', y='POPULATION', data=df)
plt.title('Population by Province')
plt.xlabel('Province')
plt.ylabel('Population')
plt.xticks(rotation=45)
plt.show()

2. Pie Chart 
plt.figure(figsize=(8, 8))
df['WATER-SUPPLY'].value_counts().plot(kind='pie', autopct='%1.1f%%')
plt.title('Water Supply Distribution')
plt.ylabel('')
plt.show()

3.Box Plot 
plt.figure(figsize=(8, 6))
sns.boxplot(x='Crime Rate', data=df)
plt.title('Crime Rate Distribution')
plt.xlabel('Crime Rate')
plt.show()

Advanced Charts:
1.Heatmap 
plt.figure(figsize=(10, 8))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix')
plt.show()

2.Scatter Plot 
plt.figure(figsize=(8, 6))
sns.scatterplot(x='Population', y='Water Supply', data=df)
plt.title('Population vs. Water Supply')
plt.xlabel('Population')
plt.ylabel('Water Supply')
plt.show()






SR 4.1: Algorithm Implementation: Sort algorithm 

def selection_sort(arr):
    n = len(arr)
    for i in range(n - 1):
        min_index = i
        for j in range(i + 1, n):
            if arr[j] < arr[min_index]:
                min_index = j
        arr[i], arr[min_index] = arr[min_index], arr[i]
    return arr


waterSup = [75, 80, 85, 95, 70, 98, 88,78,82]
sortedWaterSupp_ = selection_sort(waterSup)
print("Sorted watersupply (Ascending Order):", waterSup)





SR4.2:Algorithm Implementation: Search algorithm

Linear Search
Linear search, also known as sequential search, is the simplest search algorithm. It works by iterating through each element of a list until the desired element is found or the list ends. The time complexity of linear search is 
𝑂
(
𝑛
)
O(n), where 
𝑛
n is the number of elements in the list. This means that the search time increases linearly with the number of elements.

Binary Search
Binary search is a more efficient algorithm that works on sorted lists. It repeatedly divides the search interval in half, comparing the target value to the middle element of the list. If the target value is smaller, it narrows the search to the left half; if larger, to the right half. The time complexity of binary search is 
𝑂
(
log
⁡
𝑛
)
O(logn), where 
𝑛
n is the number of elements in the list. This logarithmic time complexity means that binary search is significantly faster than linear search for large datasets, provided the list is sorted.

Comparison
Binary search performs faster than linear search because it reduces the search space exponentially with each comparison, whereas linear search checks each element one by one.





SR4.2:Algorithm Implementation: Search algorithm

import pandas as pd


df = pd.read_csv('C:/Users/27728/OneDrive/Documents/South African Informal Settlement.csv')


column_to_search = '[WATER-SUPPLY]'


def linear_search(arr, target):
    for index, element in enumerate(arr):
        if element == target:
            return index
    return -1


def binary_search(arr, target):
    left, right = 0, len(arr) - 1
    while left <= right:
        mid = (left + right) // 2
        if arr[mid] == target:
            return mid
        elif arr[mid] < target:
            left = mid + 1
        else:
            right = mid - 1
    return -1


target_value = 70


linear_search_result = linear_search(df[column_to_search].tolist(), target_value)
print(f"Linear Search Result: {linear_search_result}")

sorted_column = sorted(df[column_to_search].tolist())
binary_search_result = binary_search(sorted_column, target_value)
print(f"Binary Search Result: {binary_search_result}")






SR 5: Complexity Analysis
Complexity Analysis of the Selection Sort Algorithm (SR 4.1)

Selection Sort Algorithm:

Selection sort is a simple comparison-based sorting algorithm. It divides the input list into two parts: the sublist of items already sorted and the
sublist of items remaining to be sorted. Initially, the sorted sublist is empty, and the unsorted sublist is the entire input list. The algorithm proceeds 
by finding the smallest (or largest, depending on the sorting order) element in the unsorted sublist, swapping it with the leftmost unsorted element (putting it in sorted order), 
 and moving the sublist boundaries one element to the right.

Selection Sort Implementation:

def selection_sort(arr):
    n = len(arr)
    for i in range(n - 1):
        min_index = i
        for j in range(i + 1, n):
            if arr[j] < arr[min_index]:
                min_index = j
        arr[i], arr[min_index] = arr[min_index], arr[i]
    return arr

Time Complexity
The time complexity of selection sort can be analyzed as follows:
-Outer loop: Runs `n-1` times.
-Inner loop: Runs approximately `n/2` times on average for each iteration of the outer loop.

Therefore, the total number of comparisons is:
\[ \text{Total comparisons} = (n-1) + (n-2) + \cdots + 1 \]
\[ = \frac{n(n-1)}{2} \]
\[ = O(n^2) \]

Best-case, Average-case, and Worst-case time complexity:
- Best-case: \(O(n^2)\)
- Average-case: \(O(n^2)\)
- Worst-case: \(O(n^2)\)

Selection sort always performs \(O(n^2)\) comparisons, regardless of the initial order of the elements.

Space Complexity
Selection sort is an in-place sorting algorithm, meaning it requires a constant amount of additional memory space.
-Space complexity: \(O(1)\)

Selection sort uses a fixed amount of extra space (a few integer variables for indexing), and no additional storage is needed for its operations.

Efficiency
-Time Efficiency: Selection sort is inefficient for large lists because of its quadratic time complexity.
It performs poorly compared to more advanced algorithms like Quick Sort, Merge Sort, and Heap Sort, which have better average and worst-case time complexities.
-Space Efficiency: It is very space-efficient because it requires only \(O(1)\) additional space.

Summary
Selection sort is simple to implement and understand but is not suitable for large datasets due to its \(O(n^2)\) time complexity. 
It is best used for small datasets or when memory space is highly constrained.






SyntaxError: unterminated string literal (detected at line 3) (3551147137.py, line 3)